Revisiting Subgradient Dominance in Robust MDPs: Counterexamples, Hardness, and Sufficient Conditions

Alex Ayoub; Arnob Ghosh; Csaba Szepesv\'ari; Thang D. Chu; Toshinori Kitamura

arxiv: 2604.21177 · v1 · submitted 2026-04-23 · 🧮 math.OC

Revisiting Subgradient Dominance in Robust MDPs: Counterexamples, Hardness, and Sufficient Conditions

Toshinori Kitamura , Arnob Ghosh , Alex Ayoub , Thang D. Chu , Csaba Szepesv\'ari This is my paper

Pith reviewed 2026-05-09 21:59 UTC · model grok-4.3

classification 🧮 math.OC

keywords robust Markov decision processessubgradient dominanceprojected subgradient descentNP-hardnesslocal minimauncertainty setspolicy optimization

0 comments

The pith

Robust MDPs with general compact uncertainty sets do not satisfy subgradient dominance, contrary to prior claims, and optimal policies can be NP-hard to find even when subgradients are computable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that claims of subgradient dominance holding for robust Markov decision processes under arbitrary compact uncertainty sets are incorrect. Counterexamples demonstrate that even uncertainty sets of cardinality two allow the objective to lack this property and admit strict suboptimal local minima. The authors prove NP-hardness of finding epsilon-optimal policies in two settings where subgradients remain efficiently computable: finite transition uncertainty sets and sa-rectangular finite transition sets paired with finite cost uncertainty. They also supply two sufficient conditions for when the dominance property does hold.

Core claim

Existing assertions that RMDPs satisfy subgradient dominance for general compact uncertainty sets are false. Even for uncertainty sets with only two elements, the RMDP objective fails to be subgradient-dominant and can possess suboptimal strict local minima. Finding an epsilon-optimal policy is NP-hard in cases with finite transition uncertainty sets and sa-rectangular finite transition uncertainty with finite cost uncertainty, despite efficient subgradient computation. Subgradient dominance holds when, for each policy, either the worst-case transition kernel or the worst-case action-value function is unique.

What carries the argument

The subgradient dominance property, under which the squared norm of any subgradient lower-bounds the optimality gap of a policy, allowing projected subgradient descent to converge to near-optimal policies in polynomial steps.

If this is right

Projected subgradient descent applied to general RMDPs can converge to policies whose value is bounded away from optimal.
Convergence guarantees previously derived from subgradient dominance no longer apply to broad classes of robust MDPs.
Solving for epsilon-optimal policies remains computationally intractable in the identified finite-uncertainty settings even though subgradients can be evaluated efficiently.
Only the two identified uniqueness conditions restore the dominance property and its associated polynomial convergence for PSD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithm designers for robust sequential decision problems should check uniqueness of worst-case kernels or values before relying on subgradient methods.
Alternative optimization techniques such as regularized objectives or global search may be needed when uniqueness fails.
Similar counterexamples and hardness results could apply to other robust optimization problems over sequential policies.

Load-bearing premise

That robust MDPs with arbitrary compact uncertainty sets always satisfy subgradient dominance, independent of whether the worst-case transition or value is unique for each policy.

What would settle it

A concrete RMDP example with a two-element uncertainty set whose objective has a strict local minimum whose value is strictly worse than the global optimum, or a polynomial-time algorithm solving an NP-hard instance of epsilon-optimal policy search under finite transition uncertainty.

Figures

Figures reproduced from arXiv: 2604.21177 by Alex Ayoub, Arnob Ghosh, Csaba Szepesv\'ari, Thang D. Chu, Toshinori Kitamura.

**Figure 1.** Figure 1: An RMDP example where PSD fails (left) and the corresponding robust total cost landscape (right). The policy π2, which always selects action a2, is a strictly suboptimal local minimum. Details are provided in Example 1. Example 1 (An RMDP where PSD fails) Consider a deterministic RMDP with three states(s1, s2, s+), two actions (a1, a2), and two transition kernels P = {P1, P2}, as illustrated in [PITH_FULL… view at source ↗

**Figure 2.** Figure 2: An illustration of the transition kernel [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 3.** Figure 3: An illustration of the transition and the m-th cost function for Cm = (xm1∨¬xm2∨xm3 ). Since the costs at other clause states sCi for i ̸= m are set to −(1 − γ) −1 , the worst-case transition kernel for cm is Pm, which transitions the agent to sCm and incurs zero cost. • Each variable state sxn is absorbing; that is, any action taken at sxn transitions to sxn with probability 1. • Transitions at the clause… view at source ↗

**Figure 4.** Figure 4: An s-rectangular RMDP example satisfying the subgradient dominance property but satisfying neither Assumption 11 nor Assumption 12. Left: P1. Right: P2. Example 2: unique worst-case transition, non-unique worst-case action-value. Now consider a cost-robust MDP with two states {s1, s2} and two actions {a1, a2}. The initial state is s1. After taking any action at s1, the agent deterministically transitions … view at source ↗

read the original abstract

Projected subgradient descent (PSD) has gained popularity for solving robust Markov decision processes (RMDPs) because it applies to a broader class of uncertainty sets than traditional dynamic programming. Existing work claims that RMDPs with a general compact uncertainty set satisfy the subgradient dominance property, under which exact PSD converges to an $\varepsilon$-optimal policy in a polynomial number of updates (e.g., Wang et al., 2023). We show that these claims are incorrect. Even when the uncertainty set has cardinality two, the RMDP objective is not subgradient-dominant and can admit suboptimal strict local minima. Moreover, we prove that finding an $\varepsilon$-optimal policy can be NP-hard even in settings where subgradients are efficiently computable: (i) finite transition uncertainty sets and (ii) $sa$-rectangular finite transition uncertainty sets with finite cost uncertainty sets. Finally, we identify two conditions under which RMDPs do satisfy subgradient dominance: when, for each policy, either the worst-case transition kernel or the worst-case action-value function is unique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper refutes overstated subgradient dominance claims for robust MDPs with explicit counterexamples even for |U|=2 and adds NP-hardness results for finite uncertainty cases.

read the letter

The main thing to know is that earlier work claimed robust MDPs with general compact uncertainty sets satisfy subgradient dominance, allowing projected subgradient descent to converge in polynomial steps. This paper shows those claims do not hold: even with uncertainty sets of cardinality two the objective is not subgradient-dominant and can have strict suboptimal local minima. It also proves that finding an epsilon-optimal policy is NP-hard for finite transition uncertainty sets and for sa-rectangular finite transition plus finite cost sets, even when subgradients remain efficiently computable. Two clean sufficient conditions are given under which dominance does hold, namely when the worst-case kernel or the worst-case Q-function is unique for each policy. The counterexamples are small and explicit, and the hardness reductions are independent of any fitted parameters or circular citations, which makes the negative results concrete and checkable. The sufficient conditions follow directly from the definitions without extra assumptions. The constructions appear free of internal contradictions or unsupported steps based on the outline and stress-test details. One minor limitation is that full verification of the proof reductions would still need the appendix, but nothing in the presented material suggests they are shaky. This work is aimed at researchers who use or analyze projected subgradient methods for robust MDPs and uncertainty-aware planning. Anyone relying on the prior convergence guarantees will want to see these corrections. It deserves a serious referee because the evidence consists of explicit, falsifiable constructions rather than vague claims, and the topic directly affects algorithm choice in the subfield.

Referee Report

0 major / 2 minor

Summary. The paper refutes prior claims that robust MDPs with general compact uncertainty sets satisfy the subgradient dominance property. It shows via explicit counterexamples (even for uncertainty sets of cardinality two) that the RMDP objective need not be subgradient-dominant and can admit suboptimal strict local minima. It further proves NP-hardness of computing an ε-optimal policy in two settings where subgradients remain efficiently computable (finite transition uncertainty sets; sa-rectangular finite transition sets with finite cost sets). Two sufficient conditions for subgradient dominance are identified: uniqueness of the worst-case transition kernel or of the worst-case action-value function for each policy.

Significance. If the counterexamples and reductions hold, the work corrects an important misconception in the literature on first-order methods for robust MDPs. It delineates precise limitations of projected subgradient descent and supplies verifiable sufficient conditions under which the method is guaranteed to succeed, thereby guiding algorithm design for robust sequential decision problems.

minor comments (2)

[Abstract / Introduction] The abstract and introduction would benefit from a brief explicit statement of the two sufficient conditions (unique worst-case kernel vs. unique worst-case Q-function) rather than deferring the precise wording entirely to the later theorem.
[Preliminaries] Notation for the uncertainty set U and the robust objective J(π) should be introduced once in a dedicated preliminary subsection to avoid repeated re-definition across sections.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending acceptance. The referee correctly identifies the key contributions: explicit counterexamples showing that RMDPs with compact uncertainty sets (even of cardinality two) need not satisfy subgradient dominance and can have suboptimal strict local minima, NP-hardness results for computing ε-optimal policies in settings where subgradients are efficiently computable, and two verifiable sufficient conditions (uniqueness of the worst-case kernel or worst-case action-value function per policy) under which subgradient dominance holds.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper refutes prior claims on subgradient dominance for general compact uncertainty sets in RMDPs by providing explicit counterexamples (including for |U|=2) that demonstrate the objective is not subgradient-dominant and admits strict suboptimal local minima. It establishes NP-hardness of ε-optimal policy search via independent complexity reductions in settings with tractable subgradient evaluation (finite transition sets and sa-rectangular finite transition+cost sets). The two sufficient conditions (unique worst-case kernel or unique worst-case Q-function per policy) are stated directly from the problem definitions without any fitted parameters, self-referential derivations, or load-bearing self-citations. All steps are self-contained against external benchmarks and falsifiable via the constructions; no derivation reduces by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard MDP assumptions (finite states/actions, compact uncertainty sets) and standard complexity theory; no free parameters are fitted to data and no new entities are postulated.

axioms (2)

domain assumption Finite state and action spaces with compact uncertainty sets
Standard background for defining RMDPs and subgradients
standard math Standard NP-hardness reductions from known hard problems
Used to establish computational hardness in the two settings

pith-pipeline@v0.9.0 · 5511 in / 1454 out tokens · 38801 ms · 2026-05-09T21:59:31.525755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021

work page 2021
[2]

Constrained Markov Decision Processes , volume 7

Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999

work page 1999
[3]

A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods

Felipe Atenas, Claudia Sagastiz \'a bal, Paulo JS Silva, and Mikhail Solodov. A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods . SIAM Journal on Optimization, 33 0 (1): 0 89--115, 2023

work page 2023
[4]

Solving Uncertain Markov Decision Processes

J Andrew Bagnell, Andrew Y Ng, and Jeff G Schneider. Solving Uncertain Markov Decision Processes . Technical report, Carnegie Mellon University, 2001

work page 2001
[5]

On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning

Anas Barakat, Souradip Chakraborty, Peihong Yu, Pratap Tokekar, and Amrit Singh Bedi. On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning . In Advances in Neural Information Processing Systems, 2025

work page 2025
[6]

Inverse Reinforcement Learning in Contextual MDPs

Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, and Tom Zahavy. Inverse Reinforcement Learning in Contextual MDPs . Machine Learning, 110 0 (9): 0 2295--2334, 2021

work page 2021
[7]

Convex optimization theory , volume 1

Dimitri Bertsekas. Convex optimization theory , volume 1. Athena Scientific, 2009

work page 2009
[8]

Your Policy Regularizer is Secretly an Adversary

Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Gr \'e goire Del \'e tang, Markus Kunesch, Shane Legg, and Pedro Ortega. Your Policy Regularizer is Secretly an Adversary . Transactions on Machine Learning Research, 2022

work page 2022
[9]

Robust Reinforcement Learning with General Utility

Ziyi Chen, Yan Wen, Zhengmian Hu, and Heng Huang. Robust Reinforcement Learning with General Utility . In Advances in Neural Information Processing Systems, 2025

work page 2025
[10]

Stochastic Model-Based Minimization of Weakly Convex Functions

Damek Davis and Dmitriy Drusvyatskiy. Stochastic Model-Based Minimization of Weakly Convex Functions . SIAM Journal on Optimization, 29 0 (1): 0 207--239, 2019

work page 2019
[11]

Twice Regularized MDPs and the Equivalence Between Robustness and Regularization

Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021

work page 2021
[12]

Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps . Mathematical Programming, 178: 0 503--558, 2019

work page 2019
[13]

Efficient Projections onto the L 1-ball for Learning in High Dimensions

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient Projections onto the L 1-ball for Learning in High Dimensions . In International Conference on Machine Learning, 2008

work page 2008
[14]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Julian Ibarz, Abhishek Gupta, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function . In International Conference on Learning Representations, 2019

work page 2019
[15]

Stochastic Optimization under Hidden Convexity

Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic Optimization under Hidden Convexity . SIAM Journal on Optimization, 35 0 (4): 0 2544--2571, 2025

work page 2025
[16]

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, and Shie Mannor. Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization . In AAAI Conference on Artificial Intelligence, 2024

work page 2024
[17]

Robust Markov Decision Processes: Beyond Rectangularity

Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023

work page 2023
[18]

Scalable First-Order Methods for Robust Mdps

Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021

work page 2021
[19]

Provably Efficient Maximum Entropy Exploration

Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably Efficient Maximum Entropy Exploration . In International Conference on Machine Learning, 2019

work page 2019
[20]

Generative Adversarial Imitation Learning

Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning . In Advances in Neural Information Processing Systems, 2016

work page 2016
[21]

Regularized Policies are Reward Robust

Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies are Reward Robust . In International Conference on Artificial Intelligence and Statistics, 2021

work page 2021
[22]

Robust Dynamic Programming

Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005

work page 2005
[23]

Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably Efficient Reinforcement Learning with Linear Function Approximation . In Conference on Learning Theory, 2020

work page 2020
[24]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002

work page 2002
[25]

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, and Yutaka Matsuo. Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form . In International Conference on Learning Representations, 2025

work page 2025
[26]

Policy Gradient for Rectangular Robust Markov Decision Processes

Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024

work page 2024
[27]

Policy gradient algorithms for robust mdps with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets . arXiv preprint arXiv:2305.19004, 2023

work page arXiv 2023
[28]

First-Order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022
[29]

Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality

Shaocong Ma, Ziyi Chen, Yi Zhou, and Heng Huang. Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality . Transactions on Machine Learning Research, 2025

work page 2025
[30]

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Shie Mannor, Ofir Mebel, and Huan Xu. Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty . In International Conference on Machine Learning, 2012

work page 2012
[31]

On the Global Convergence Rates of Softmax Policy Gradient Methods

Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020

work page 2020
[32]

Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005

work page 2005
[33]

Sequential Decision-Making under Uncertainty: A Robust MDPs Review

Wenfan Ou and Sheng Bi. Sequential Decision-Making under Uncertainty: A Robust MDPs Review . Annals of Operations Research, 353 0 (3): 0 1239--1285, 2025

work page 2025
[34]

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., 1994

work page 1994
[35]

On the Moreau Envelope Properties of Weakly Convex Functions

Marien Renaud, Arthur Leclaire, and Nicolas Papadakis. On the Moreau Envelope Properties of Weakly Convex Functions . arXiv preprint arXiv:2509.13960, 2025

work page arXiv 2025
[36]

Variational Analysis , volume 317

R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009

work page 2009
[37]

Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020

work page arXiv 2010
[38]

On General Minimax Theorems

Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958

work page 1958
[39]

Taguchi, G

G. Taguchi, G. Taguchi, and Asian Productivity Organization. Introduction to Quality Engineering: Designing Quality Into Products and Processes . Asian Productivity Organization, 1986

work page 1986
[40]

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023

work page 2023
[41]

Policy Gradient for Robust Markov Decision Processes

Qiuhao Wang, Shaohang Xu, Chin Pang Ho, and Marek Petrik. Policy Gradient for Robust Markov Decision Processes . arXiv preprint arXiv:2410.22114, 2024

work page arXiv 2024
[42]

Policy Gradient Method for Robust Reinforcement Learning

Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022

work page 2022
[43]

Robust Markov Decision Processes

Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013

work page 2013
[44]

On the Convergence Rates of Policy Gradient Methods

Lin Xiao. On the Convergence Rates of Policy Gradient Methods . Journal of Machine Learning Research, 23 0 (282): 0 1--36, 2022

work page 2022
[45]

Robust Markov Decision Processes without Model Estimation

Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023

work page arXiv 2023
[46]

Reward is Enough for Convex MDPs

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021

work page 2021
[47]

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational Policy Gradient Method for Reinforcement Learning with General Utilities . In Advances in Neural Information Processing Systems, 2020

work page 2020
[48]

Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity

Runyu Zhang, Yang Hu, and Na Li. Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity . In International Conference on Learning Representations, 2024

work page 2024

[1] [1]

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021

work page 2021

[2] [2]

Constrained Markov Decision Processes , volume 7

Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999

work page 1999

[3] [3]

A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods

Felipe Atenas, Claudia Sagastiz \'a bal, Paulo JS Silva, and Mikhail Solodov. A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods . SIAM Journal on Optimization, 33 0 (1): 0 89--115, 2023

work page 2023

[4] [4]

Solving Uncertain Markov Decision Processes

J Andrew Bagnell, Andrew Y Ng, and Jeff G Schneider. Solving Uncertain Markov Decision Processes . Technical report, Carnegie Mellon University, 2001

work page 2001

[5] [5]

On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning

Anas Barakat, Souradip Chakraborty, Peihong Yu, Pratap Tokekar, and Amrit Singh Bedi. On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning . In Advances in Neural Information Processing Systems, 2025

work page 2025

[6] [6]

Inverse Reinforcement Learning in Contextual MDPs

Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, and Tom Zahavy. Inverse Reinforcement Learning in Contextual MDPs . Machine Learning, 110 0 (9): 0 2295--2334, 2021

work page 2021

[7] [7]

Convex optimization theory , volume 1

Dimitri Bertsekas. Convex optimization theory , volume 1. Athena Scientific, 2009

work page 2009

[8] [8]

Your Policy Regularizer is Secretly an Adversary

Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Gr \'e goire Del \'e tang, Markus Kunesch, Shane Legg, and Pedro Ortega. Your Policy Regularizer is Secretly an Adversary . Transactions on Machine Learning Research, 2022

work page 2022

[9] [9]

Robust Reinforcement Learning with General Utility

Ziyi Chen, Yan Wen, Zhengmian Hu, and Heng Huang. Robust Reinforcement Learning with General Utility . In Advances in Neural Information Processing Systems, 2025

work page 2025

[10] [10]

Stochastic Model-Based Minimization of Weakly Convex Functions

Damek Davis and Dmitriy Drusvyatskiy. Stochastic Model-Based Minimization of Weakly Convex Functions . SIAM Journal on Optimization, 29 0 (1): 0 207--239, 2019

work page 2019

[11] [11]

Twice Regularized MDPs and the Equivalence Between Robustness and Regularization

Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021

work page 2021

[12] [12]

Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps . Mathematical Programming, 178: 0 503--558, 2019

work page 2019

[13] [13]

Efficient Projections onto the L 1-ball for Learning in High Dimensions

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient Projections onto the L 1-ball for Learning in High Dimensions . In International Conference on Machine Learning, 2008

work page 2008

[14] [14]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Julian Ibarz, Abhishek Gupta, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function . In International Conference on Learning Representations, 2019

work page 2019

[15] [15]

Stochastic Optimization under Hidden Convexity

Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic Optimization under Hidden Convexity . SIAM Journal on Optimization, 35 0 (4): 0 2544--2571, 2025

work page 2025

[16] [16]

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, and Shie Mannor. Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization . In AAAI Conference on Artificial Intelligence, 2024

work page 2024

[17] [17]

Robust Markov Decision Processes: Beyond Rectangularity

Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023

work page 2023

[18] [18]

Scalable First-Order Methods for Robust Mdps

Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021

work page 2021

[19] [19]

Provably Efficient Maximum Entropy Exploration

Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably Efficient Maximum Entropy Exploration . In International Conference on Machine Learning, 2019

work page 2019

[20] [20]

Generative Adversarial Imitation Learning

Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning . In Advances in Neural Information Processing Systems, 2016

work page 2016

[21] [21]

Regularized Policies are Reward Robust

Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies are Reward Robust . In International Conference on Artificial Intelligence and Statistics, 2021

work page 2021

[22] [22]

Robust Dynamic Programming

Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005

work page 2005

[23] [23]

Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably Efficient Reinforcement Learning with Linear Function Approximation . In Conference on Learning Theory, 2020

work page 2020

[24] [24]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002

work page 2002

[25] [25]

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, and Yutaka Matsuo. Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form . In International Conference on Learning Representations, 2025

work page 2025

[26] [26]

Policy Gradient for Rectangular Robust Markov Decision Processes

Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024

work page 2024

[27] [27]

Policy gradient algorithms for robust mdps with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets . arXiv preprint arXiv:2305.19004, 2023

work page arXiv 2023

[28] [28]

First-Order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022

[29] [29]

Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality

Shaocong Ma, Ziyi Chen, Yi Zhou, and Heng Huang. Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality . Transactions on Machine Learning Research, 2025

work page 2025

[30] [30]

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Shie Mannor, Ofir Mebel, and Huan Xu. Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty . In International Conference on Machine Learning, 2012

work page 2012

[31] [31]

On the Global Convergence Rates of Softmax Policy Gradient Methods

Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020

work page 2020

[32] [32]

Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005

work page 2005

[33] [33]

Sequential Decision-Making under Uncertainty: A Robust MDPs Review

Wenfan Ou and Sheng Bi. Sequential Decision-Making under Uncertainty: A Robust MDPs Review . Annals of Operations Research, 353 0 (3): 0 1239--1285, 2025

work page 2025

[34] [34]

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., 1994

work page 1994

[35] [35]

On the Moreau Envelope Properties of Weakly Convex Functions

Marien Renaud, Arthur Leclaire, and Nicolas Papadakis. On the Moreau Envelope Properties of Weakly Convex Functions . arXiv preprint arXiv:2509.13960, 2025

work page arXiv 2025

[36] [36]

Variational Analysis , volume 317

R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009

work page 2009

[37] [37]

Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020

work page arXiv 2010

[38] [38]

On General Minimax Theorems

Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958

work page 1958

[39] [39]

Taguchi, G

G. Taguchi, G. Taguchi, and Asian Productivity Organization. Introduction to Quality Engineering: Designing Quality Into Products and Processes . Asian Productivity Organization, 1986

work page 1986

[40] [40]

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023

work page 2023

[41] [41]

Policy Gradient for Robust Markov Decision Processes

Qiuhao Wang, Shaohang Xu, Chin Pang Ho, and Marek Petrik. Policy Gradient for Robust Markov Decision Processes . arXiv preprint arXiv:2410.22114, 2024

work page arXiv 2024

[42] [42]

Policy Gradient Method for Robust Reinforcement Learning

Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022

work page 2022

[43] [43]

Robust Markov Decision Processes

Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013

work page 2013

[44] [44]

On the Convergence Rates of Policy Gradient Methods

Lin Xiao. On the Convergence Rates of Policy Gradient Methods . Journal of Machine Learning Research, 23 0 (282): 0 1--36, 2022

work page 2022

[45] [45]

Robust Markov Decision Processes without Model Estimation

Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023

work page arXiv 2023

[46] [46]

Reward is Enough for Convex MDPs

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021

work page 2021

[47] [47]

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational Policy Gradient Method for Reinforcement Learning with General Utilities . In Advances in Neural Information Processing Systems, 2020

work page 2020

[48] [48]

Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity

Runyu Zhang, Yang Hu, and Na Li. Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity . In International Conference on Learning Representations, 2024

work page 2024