Robust Markov Decision Processes on Continuous State Spaces

Daniel Kuhn; Mengmeng Li; Yan Li; Yifan Hu

arxiv: 2605.28706 · v1 · pith:S36N4V3Knew · submitted 2026-05-27 · 🧮 math.OC

Robust Markov Decision Processes on Continuous State Spaces

Mengmeng Li , Yifan Hu , Daniel Kuhn , Yan Li This is my paper

Pith reviewed 2026-06-29 10:28 UTC · model grok-4.3

classification 🧮 math.OC

keywords robust MDPscontinuous state spacesstochastic first-order methodpolicy evaluationpolicy iterationsample complexityambiguity setsMarkov games

0 comments

The pith

Stochastic first-order method achieves Õ(1/ε²) sample complexity for robust MDPs on continuous state spaces

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce a stochastic first-order method for robust policy evaluation in infinite-horizon robust MDPs on continuous state spaces. The method uses a structured rectangular ambiguity set contained in the convex hull of unknown kernels. It establishes high probability convergence to the robust value function, resulting in an Õ(1/ε²) sample complexity. This is then applied in approximate policy iteration to find an ε-optimal policy with the same complexity. The results are new for continuous state spaces and extend to zero-sum Markov games.

Core claim

We utilize the dynamic formulation of the corresponding robust MDPs, and subsequently introduce a stochastic first-order method for robust policy evaluation. We establish its high probability convergence to the robust value function, which in turn leads to an Õ(1/ε²) sample complexity. This high probability accuracy certificate is then used in an approximate policy iteration method that finds an ε-optimal policy with Õ(1/ε²) samples. The obtained sample complexities for both robust policy evaluation and optimization appear to be new for robust MDPs with continuous state spaces. Of independent interest, the proposed method is also directly applicable to zero-sum Markov games, which seems to s

What carries the argument

Stochastic first-order method applied to the dynamic programming formulation of the robust MDP

If this is right

Robust policy evaluation converges with high probability using Õ(1/ε²) samples.
Approximate policy iteration identifies an ε-optimal policy with Õ(1/ε²) samples.
The method applies to zero-sum Markov games and improves existing sample complexities for continuous state spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The avoidance of discretization suggests the approach could scale directly to higher-dimensional robust control problems.
Similar first-order techniques might transfer to robust optimization problems outside MDPs that share rectangular uncertainty structure.
Empirical tests on benchmark continuous-state problems would check whether the theoretical sample bounds translate to observed performance.

Load-bearing premise

The rectangular ambiguity set lies inside the convex hull of unknown generating kernels so the dynamic program remains tractable for direct application of stochastic first-order methods without discretization.

What would settle it

A specific continuous-state robust MDP where robust policy evaluation requires more than Õ(1/ε²) samples to reach ε-accuracy would contradict the stated high-probability convergence.

read the original abstract

We study infinite-horizon robust Markov decision processes (MDPs) on continuous state spaces with structured rectangular ambiguity set. The proposed ambiguity set falls within the convex hull of unknown generating kernels. We utilize the dynamic formulation of the corresponding robust MDPs, and subsequently introduce a stochastic first-order method for robust policy evaluation. We establish its high probability convergence to the robust value function, which in turn leads to an $\widetilde{\mathcal O}(1/\epsilon^2)$ sample complexity. This high probability accuracy certificate is then used in an approximate policy iteration method that finds an $\epsilon$-optimal policy with $\widetilde{\mathcal O}(1/\epsilon^2)$ samples. The obtained sample complexities for both robust policy evaluation and optimization appear to be new for robust MDPs with continuous state spaces. Of independent interest, the proposed method is also directly applicable to zero-sum Markov games, which seems to strictly improve the existing sample complexities for continuous state spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims the first explicit high-probability O(1/ε²) sample bounds for robust policy evaluation and optimization on continuous-state MDPs using stochastic first-order methods on a rectangular ambiguity set; the claims look coherent but rest on unshown proofs.

read the letter

The main thing to know is that this paper derives high-probability sample complexities of order 1/ε² for both robust policy evaluation and for finding an ε-optimal policy in infinite-horizon robust MDPs on continuous state spaces. They apply a stochastic first-order method to the dynamic programming formulation under a structured rectangular ambiguity set that lies inside the convex hull of unknown generating kernels, then lift the evaluation certificate into approximate policy iteration. They also note that the same method applies to zero-sum Markov games and improves existing bounds there.

What is actually new is the explicit sample bound for the continuous-state case, which the abstract states has not appeared before for robust MDPs. The route from stochastic first-order evaluation to policy iteration is straightforward and the high-probability guarantee is stated clearly. The game extension is a useful side result. The construction avoids obvious circularity; the bounds come from the stochastic analysis rather than post-hoc fitting.

The soft spots are in the continuous-space handling. The abstract asserts convergence and the sample complexity without showing derivation steps or how the usual technical issues (function approximation, measure-theoretic details, or discretization error) are controlled so they do not inflate the bound. The key modeling assumption that the ambiguity set sits inside the convex hull of generating kernels is central, and it is not obvious how restrictive it is or how it would be checked in applications. These points are not fatal on the surface but need verification.

This is for researchers working on robust reinforcement learning and sample complexity results for MDPs with continuous states. A reader who follows robust RL would get concrete value from the stated bounds if they hold. It deserves a serious referee to check the proofs, the novelty claim against the discrete literature, and whether the continuous-space technicalities are handled cleanly.

Referee Report

1 major / 2 minor

Summary. The paper studies infinite-horizon robust MDPs on continuous state spaces under a structured rectangular ambiguity set contained in the convex hull of unknown generating kernels. It employs the dynamic programming formulation to develop a stochastic first-order method for robust policy evaluation that achieves high-probability convergence to the robust value function with ilde O(1/ε²) sample complexity. This certificate is then lifted to an approximate policy iteration procedure that yields an ε-optimal policy with the same sample complexity. The approach is also shown to apply directly to zero-sum Markov games, improving existing sample complexities for continuous-state settings.

Significance. If the stated high-probability bounds and sample complexities hold, the work would provide the first such guarantees for robust MDPs and games on continuous state spaces, extending beyond the discrete-state literature. The direct application of stochastic first-order methods to the continuous-state dynamic program without explicit discretization is a technical strength, and the rectangular ambiguity set construction enables the claimed tractability.

major comments (1)

[Abstract and §3 (policy evaluation)] The central sample-complexity claims rest on the assumption that the rectangular ambiguity set lies inside the convex hull of the generating kernels and that the resulting robust Bellman operator remains amenable to stochastic first-order analysis without incurring approximation errors that would invalidate the ilde O(1/ε²) bound. This assumption is load-bearing for both the policy-evaluation and policy-iteration results; a concrete verification that the continuous-state technicalities (e.g., measurability and Lipschitz conditions on the kernels) are handled without hidden discretization is required in the proof of Theorem 3.1 or the corresponding section.

minor comments (2)

Notation for the ambiguity set and the stochastic gradient estimator should be introduced with explicit dependence on the continuous state variable to avoid ambiguity when the method is applied to games.
The high-probability statement would benefit from an explicit statement of the failure probability δ and how it enters the ilde O notation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for recognizing the potential significance of the first high-probability sample-complexity guarantees for continuous-state robust MDPs. We address the single major comment below and are prepared to revise the manuscript to strengthen the exposition of the technical conditions.

read point-by-point responses

Referee: [Abstract and §3 (policy evaluation)] The central sample-complexity claims rest on the assumption that the rectangular ambiguity set lies inside the convex hull of the generating kernels and that the resulting robust Bellman operator remains amenable to stochastic first-order analysis without incurring approximation errors that would invalidate the ilde O(1/ε²) bound. This assumption is load-bearing for both the policy-evaluation and policy-iteration results; a concrete verification that the continuous-state technicalities (e.g., measurability and Lipschitz conditions on the kernels) are handled without hidden discretization is required in the proof of Theorem 3.1 or the corresponding section.

Authors: We agree that explicit verification of these conditions is important for the continuous-state setting. In the current proof of Theorem 3.1 we start from the dynamic-programming formulation and apply the stochastic first-order method directly to the robust Bellman operator induced by the rectangular ambiguity set contained in the convex hull of the generating kernels. The proof invokes the standing measurability and Lipschitz assumptions stated in Section 2 on the kernels; these ensure that the robust operator remains a contraction mapping with the same modulus as the nominal case and that the stochastic gradients are unbiased with bounded variance, allowing the high-probability convergence analysis to proceed via standard martingale concentration without any discretization step. The same argument carries over to the approximate policy iteration result. To make the verification fully self-contained, we will add a short dedicated paragraph (or subsection) immediately after the statement of Theorem 3.1 that explicitly checks the required measurability, Lipschitz, and unbiased-gradient properties under the rectangular construction and confirms that no auxiliary discretization is introduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a stochastic first-order method for robust policy evaluation on continuous-state robust MDPs, deriving high-probability convergence and the resulting ilde O(1/ε²) sample complexity directly from the method's analysis applied to the dynamic programming formulation. This is then lifted to approximate policy iteration. No step reduces a claimed prediction or result to a fitted input, self-defined quantity, or load-bearing self-citation by construction; the bounds are obtained from standard stochastic optimization convergence arguments without circular reduction to the paper's own inputs or prior fitted constants. The extension to Markov games is stated as a direct applicability without invoking uniqueness theorems or ansatzes from overlapping prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the rectangular structure of the ambiguity set, its representation as the convex hull of unknown kernels, and the validity of the dynamic programming formulation for robust value functions on continuous spaces.

axioms (2)

domain assumption The ambiguity set is structured rectangular and contained in the convex hull of unknown generating kernels.
Stated directly in the abstract as the modeling choice for the robust MDP.
domain assumption The dynamic formulation of the robust MDP admits a stochastic first-order method whose iterates converge to the robust value function.
Invoked when the paper 'utilizes the dynamic formulation' to introduce the stochastic method.

pith-pipeline@v0.9.1-grok · 5689 in / 1569 out tokens · 30134 ms · 2026-06-29T10:28:36.011812+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy Gradient for Continuous-Time Robust Markov Decision Processes
cs.LG 2026-06 unverdicted novelty 7.0

Extends robust MDPs to continuous time with policy gradient derivations using differential equation methods and proposes optimizers achieving linear convergence and specific sample complexities.

Reference graph

Works this paper leans on

79 extracted references · 13 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

A natural actor-critic framework for zero-sum Markov games

Ahmet Alacaoglu, Luca Viano, Niao He, and Volkan Cevher. A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366, 2022

2022
[2]

Springer, 2006

Charalambos D Aliprantis and Kim C Border.Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2006

2006
[3]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on Learning Theory, pages 1691–1692, 2018

2018
[4]

Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017

Stephen Boyd, Enzo Busseti, Steve Diamond, Ronald N Kahn, Kwangmoo Koh, Peter Nystrup, and Jan Speth. Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017

2017
[5]

Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022

Shicong Cen, Yuejie Chi, Simon S Du, and Lin Xiao. Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022

work page arXiv 2022
[6]

A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011

2011
[7]

A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. InNeural Information Processing Systems, volume 36, pages 75826–75883, 2023

2023
[8]

Almost optimal algorithms for two-player zero-sum linear mixture Markov games

Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zero-sum linear mixture Markov games. InInternational Conference on Algorithmic Learning Theory, pages 227–261, 2022

2022
[9]

Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

2016
[10]

Cambridge Uni- versity Press, 2014

Giuseppe Da Prato and Jerzy Zabczyk.Stochastic Equations in Infinite Dimensions. Cambridge Uni- versity Press, 2014

2014
[11]

From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021

Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, and Junyu Zhang. From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021

2021
[12]

Robust Markov decision process: Beyond rectangularity

Vineet Goyal and Julien Grand-Clement. Robust Markov decision process: Beyond rectangularity. Mathematics of Operations Research, 48(1):203–226, 2022

2022
[13]

Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021

Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021

2021
[14]

Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016

Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016

2016
[15]

Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018

Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018

2029
[16]

Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021

Baihe Huang, Jason D Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021. 26

work page arXiv 2021
[17]

OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020

Christian D Hubbs, Hector D Perez, Owais Sarwar, Nikolaos V Sahinidis, Ignacio E Grossmann, and John M Wassick. OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020

work page arXiv 2008
[18]

Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

Garud Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

2005
[19]

The power of exploiter: Provable multi-agent RL in large state spaces

Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent RL in large state spaces. InInternational Conference on Machine Learning, pages 10251–10279, 2022

2022
[20]

Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715, 2022

work page arXiv 2022
[21]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pages 267–274, 2002

2002
[22]

Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning.SIAM Journal on Optimization, 32(2):1120–1155, 2022

2022
[23]

Policy gradient for rectangular robust Markov decision processes

Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy gradient for rectangular robust Markov decision processes. InNeural Information Processing Systems, pages 59477–59501, 2023

2023
[24]

Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025

Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, and Shie Mannor. Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025

work page arXiv 2025
[25]

Efficient value iteration for s- rectangular robust Markov decision processes

Navdeep Kumar, Kaixin Wang, Kfir Yehuda Levy, and Shie Mannor. Efficient value iteration for s- rectangular robust Markov decision processes. InInternational Conference on Machine Learning, pages 25682–25725, 2024

2024
[26]

A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023

Guanghui Lan and Yan Li. A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023

work page arXiv 2023
[27]

PhD thesis, Massachusetts Institute of Technology, 2007

Yann Le Tallec.Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 2007

2007
[28]

Minimax-optimal multi-agent RL in Markov games with a generative model

Gen Li, Yuejie Chi, Yuting Wei, and Yuxin Chen. Minimax-optimal multi-agent RL in Markov games with a generative model. InNeural Information Processing Systems, pages 15353–15367, 2022

2022
[29]

Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

work page arXiv 2023
[30]

Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025

Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025

work page arXiv 2025
[31]

Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences

Yan Li. Sample average approximation for distributionally robust optimization withϕ-divergences. arXiv preprint arXiv:2604.10855, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023

Yan Li and Guanghui Lan. First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023

work page arXiv 2023
[33]

Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025

Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025

2025
[34]

First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022

Yan Li, Tuo Zhao, and Guanghui Lan. First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022
[35]

Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

Zhenghao Li, Shengbo Wang, and Nian Si. Near-optimal sample complexities of divergence-based s- rectangular distributionally robust reinforcement learning.arXiv preprint arXiv:2505.12202, 2025. 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Reinforcement learning for robust parameterized locomotion control of bipedal robots

Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Reinforcement learning for robust parameterized locomotion control of bipedal robots. In IEEE International Conference on Robotics and Automation, pages 2811–2817, 2021

2021
[37]

A single-loop robust policy gradient method for robust Markov decision processes

Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A single-loop robust policy gradient method for robust Markov decision processes. InInternational Conference on Machine Learning, pages 30392 – 30426, 2024

2024
[38]

Distributionally robustQ-learning

Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robustQ-learning. InInternational Conference on Machine Learning, pages 13623– 13643, 2022

2022
[39]

Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008

R´ emi Munos and Csaba Szepesv´ ari. Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008

2008
[40]

Arkadi Nemirovski. Prox-method with rate of convergenceO(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15(1):229–251, 2004

2004
[41]

Problem complexity and method efficiency in optimization

Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983
[42]

Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005

Yu Nesterov. Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005

2005
[43]

Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009

Yurii Nesterov. Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009

2009
[44]

Markov decision processes under model uncertainty

Ariel Neufeld, Julian Sester, and Mario ˇSiki´ c. Markov decision processes under model uncertainty. Mathematical Finance, 33(3):618–665, 2023

2023
[45]

Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

2005
[46]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582– 9602, 2022

2022
[47]

Robust reinforcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust reinforcement learning using offline data. InNeural Information Processing Systems, pages 32211–32224, 2022

2022
[48]

Patek.Stochastic Shortest Path Games: Theory and Algorithms

Stephen D. Patek.Stochastic Shortest Path Games: Theory and Algorithms. PhD thesis, Massachusetts Institute of Technology, 1997

1997
[49]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation, pages 3803–3810, 2018

2018
[50]

Approximate dynamic programming for two-player zero-sum Markov games

Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum Markov games. InInternational Conference on Machine Learning, pages 1321– 1329, 2015

2015
[51]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InNeural Information Processing Systems, 2007

2007
[52]

Distribu- tionally robust model-based reinforcement learning with large state spaces

Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distribu- tionally robust model-based reinforcement learning with large state spaces. InInternational Conference on Artificial Intelligence and Statistics, pages 100–108, 2024. 28

2024
[53]

Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

2017
[54]

The curious price of distributional robustness in reinforcement learning with a generative model

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. The curious price of distributional robustness in reinforcement learning with a generative model. InNeural Information Processing Systems, pages 79903–79917, 2023

2023
[55]

Springer, 2009

Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo.Robotics: Modelling, Planning and Control. Springer, 2009

2009
[56]

Near-optimal time and sample complexities for solving Markov decision processes with a generative model

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. InNeural Information Processing Systems, 2018

2018
[57]

Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023

Zhuoqing Song, Jason D Lee, and Zhuoran Yang. Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023

work page arXiv 2023
[58]

Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

1988
[59]

Dynamic control of autonomous quadrotor flight in an estimated wind field

Nitin Sydney, Brendan Smyth, and Derek A Paley. Dynamic control of autonomous quadrotor flight in an estimated wind field. In52nd IEEE Conference on Decision and Control, pages 3609–3616. IEEE, 2013

2013
[60]

Scaling up robust MDPs using function approximation

Aviv Tamar, Shie Mannor, and Huan Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189, 2014

2014
[61]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017

2017
[62]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012

2012
[63]

Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996

1996
[64]

Springer, 2008

Alexandre B Tsybakov.Introduction to Nonparametric Estimation. Springer, 2008

2008
[65]

Mesures dans les espaces produits.Atti Accad

CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949

1949
[66]

Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978

Jan Van der Wal. Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978

1978
[67]

Cambridge Univer- sity Press, 2019

Martin J Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Univer- sity Press, 2019

2019
[68]

Policy gradient in robust MDPs with global conver- gence guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust MDPs with global conver- gence guarantee. InInternational Conference on Machine Learning, pages 35763–35797, 2023

2023
[69]

A finite sample complexity bound for distributionally robustQ-learning

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robustQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 3370–3398, 2023

2023
[70]

Policy gradient method for robust reinforcement learning

Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational Conference on Machine Learning, pages 23484–23526, 2022

2022
[71]

Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013

Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 29

2013
[72]

Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium

Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. InConference on Learning Theory, pages 3674–3682, 2020

2020
[73]

Improved sample complexity bounds for distri- butionally robust reinforcement learning

Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distri- butionally robust reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 9728–9754, 2023

2023
[74]

Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024

Yuling Yan, Gen Li, Yuxin Chen, and Jianqing Fan. Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024

2024
[75]

Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994

1994
[76]

Regularized gradient descent ascent for two-player zero-sum Markov games

Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum Markov games. InNeural Information Processing Systems, pages 34546–34558, 2022

2022
[77]

Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023

Kaiqing Zhang, Sham M Kakade, Tamer Basar, and Lin F Yang. Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023

2023
[78]

Provably efficient policy optimization for two- player zero-sum Markov games

Yulai Zhao, Yuandong Tian, Jason Lee, and Simon Du. Provably efficient policy optimization for two- player zero-sum Markov games. InInternational Conference on Artificial Intelligence and Statistics, pages 2736–2761, 2022

2022
[79]

Natural actor-critic for robust reinforcement learning with function approximation

Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor-critic for robust reinforcement learning with function approximation. InNeural Information Processing Systems, pages 97–133, 2023. A Appendix We introduce the following non-robust Bellman evaluation operator, namely for allv∈C b(S) ands∈ S, T P π v(s) = X a∈A π(a|s) r(...

2023

[1] [1]

A natural actor-critic framework for zero-sum Markov games

Ahmet Alacaoglu, Luca Viano, Niao He, and Volkan Cevher. A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366, 2022

2022

[2] [2]

Springer, 2006

Charalambos D Aliprantis and Kim C Border.Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2006

2006

[3] [3]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on Learning Theory, pages 1691–1692, 2018

2018

[4] [4]

Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017

Stephen Boyd, Enzo Busseti, Steve Diamond, Ronald N Kahn, Kwangmoo Koh, Peter Nystrup, and Jan Speth. Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017

2017

[5] [5]

Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022

Shicong Cen, Yuejie Chi, Simon S Du, and Lin Xiao. Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022

work page arXiv 2022

[6] [6]

A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011

2011

[7] [7]

A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. InNeural Information Processing Systems, volume 36, pages 75826–75883, 2023

2023

[8] [8]

Almost optimal algorithms for two-player zero-sum linear mixture Markov games

Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zero-sum linear mixture Markov games. InInternational Conference on Algorithmic Learning Theory, pages 227–261, 2022

2022

[9] [9]

Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

2016

[10] [10]

Cambridge Uni- versity Press, 2014

Giuseppe Da Prato and Jerzy Zabczyk.Stochastic Equations in Infinite Dimensions. Cambridge Uni- versity Press, 2014

2014

[11] [11]

From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021

Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, and Junyu Zhang. From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021

2021

[12] [12]

Robust Markov decision process: Beyond rectangularity

Vineet Goyal and Julien Grand-Clement. Robust Markov decision process: Beyond rectangularity. Mathematics of Operations Research, 48(1):203–226, 2022

2022

[13] [13]

Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021

Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021

2021

[14] [14]

Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016

Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016

2016

[15] [15]

Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018

Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018

2029

[16] [16]

Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021

Baihe Huang, Jason D Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021. 26

work page arXiv 2021

[17] [17]

OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020

Christian D Hubbs, Hector D Perez, Owais Sarwar, Nikolaos V Sahinidis, Ignacio E Grossmann, and John M Wassick. OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020

work page arXiv 2008

[18] [18]

Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

Garud Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

2005

[19] [19]

The power of exploiter: Provable multi-agent RL in large state spaces

Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent RL in large state spaces. InInternational Conference on Machine Learning, pages 10251–10279, 2022

2022

[20] [20]

Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715, 2022

work page arXiv 2022

[21] [21]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pages 267–274, 2002

2002

[22] [22]

Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning.SIAM Journal on Optimization, 32(2):1120–1155, 2022

2022

[23] [23]

Policy gradient for rectangular robust Markov decision processes

Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy gradient for rectangular robust Markov decision processes. InNeural Information Processing Systems, pages 59477–59501, 2023

2023

[24] [24]

Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025

Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, and Shie Mannor. Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025

work page arXiv 2025

[25] [25]

Efficient value iteration for s- rectangular robust Markov decision processes

Navdeep Kumar, Kaixin Wang, Kfir Yehuda Levy, and Shie Mannor. Efficient value iteration for s- rectangular robust Markov decision processes. InInternational Conference on Machine Learning, pages 25682–25725, 2024

2024

[26] [26]

A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023

Guanghui Lan and Yan Li. A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023

work page arXiv 2023

[27] [27]

PhD thesis, Massachusetts Institute of Technology, 2007

Yann Le Tallec.Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 2007

2007

[28] [28]

Minimax-optimal multi-agent RL in Markov games with a generative model

Gen Li, Yuejie Chi, Yuting Wei, and Yuxin Chen. Minimax-optimal multi-agent RL in Markov games with a generative model. InNeural Information Processing Systems, pages 15353–15367, 2022

2022

[29] [29]

Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

work page arXiv 2023

[30] [30]

Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025

Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025

work page arXiv 2025

[31] [31]

Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences

Yan Li. Sample average approximation for distributionally robust optimization withϕ-divergences. arXiv preprint arXiv:2604.10855, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023

Yan Li and Guanghui Lan. First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023

work page arXiv 2023

[33] [33]

Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025

Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025

2025

[34] [34]

First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022

Yan Li, Tuo Zhao, and Guanghui Lan. First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022

[35] [35]

Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

Zhenghao Li, Shengbo Wang, and Nian Si. Near-optimal sample complexities of divergence-based s- rectangular distributionally robust reinforcement learning.arXiv preprint arXiv:2505.12202, 2025. 27

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Reinforcement learning for robust parameterized locomotion control of bipedal robots

Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Reinforcement learning for robust parameterized locomotion control of bipedal robots. In IEEE International Conference on Robotics and Automation, pages 2811–2817, 2021

2021

[37] [37]

A single-loop robust policy gradient method for robust Markov decision processes

Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A single-loop robust policy gradient method for robust Markov decision processes. InInternational Conference on Machine Learning, pages 30392 – 30426, 2024

2024

[38] [38]

Distributionally robustQ-learning

Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robustQ-learning. InInternational Conference on Machine Learning, pages 13623– 13643, 2022

2022

[39] [39]

Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008

R´ emi Munos and Csaba Szepesv´ ari. Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008

2008

[40] [40]

Arkadi Nemirovski. Prox-method with rate of convergenceO(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15(1):229–251, 2004

2004

[41] [41]

Problem complexity and method efficiency in optimization

Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983

[42] [42]

Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005

Yu Nesterov. Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005

2005

[43] [43]

Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009

Yurii Nesterov. Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009

2009

[44] [44]

Markov decision processes under model uncertainty

Ariel Neufeld, Julian Sester, and Mario ˇSiki´ c. Markov decision processes under model uncertainty. Mathematical Finance, 33(3):618–665, 2023

2023

[45] [45]

Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

2005

[46] [46]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582– 9602, 2022

2022

[47] [47]

Robust reinforcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust reinforcement learning using offline data. InNeural Information Processing Systems, pages 32211–32224, 2022

2022

[48] [48]

Patek.Stochastic Shortest Path Games: Theory and Algorithms

Stephen D. Patek.Stochastic Shortest Path Games: Theory and Algorithms. PhD thesis, Massachusetts Institute of Technology, 1997

1997

[49] [49]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation, pages 3803–3810, 2018

2018

[50] [50]

Approximate dynamic programming for two-player zero-sum Markov games

Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum Markov games. InInternational Conference on Machine Learning, pages 1321– 1329, 2015

2015

[51] [51]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InNeural Information Processing Systems, 2007

2007

[52] [52]

Distribu- tionally robust model-based reinforcement learning with large state spaces

Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distribu- tionally robust model-based reinforcement learning with large state spaces. InInternational Conference on Artificial Intelligence and Statistics, pages 100–108, 2024. 28

2024

[53] [53]

Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

2017

[54] [54]

The curious price of distributional robustness in reinforcement learning with a generative model

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. The curious price of distributional robustness in reinforcement learning with a generative model. InNeural Information Processing Systems, pages 79903–79917, 2023

2023

[55] [55]

Springer, 2009

Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo.Robotics: Modelling, Planning and Control. Springer, 2009

2009

[56] [56]

Near-optimal time and sample complexities for solving Markov decision processes with a generative model

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. InNeural Information Processing Systems, 2018

2018

[57] [57]

Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023

Zhuoqing Song, Jason D Lee, and Zhuoran Yang. Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023

work page arXiv 2023

[58] [58]

Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

Richard S Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

1988

[59] [59]

Dynamic control of autonomous quadrotor flight in an estimated wind field

Nitin Sydney, Brendan Smyth, and Derek A Paley. Dynamic control of autonomous quadrotor flight in an estimated wind field. In52nd IEEE Conference on Decision and Control, pages 3609–3616. IEEE, 2013

2013

[60] [60]

Scaling up robust MDPs using function approximation

Aviv Tamar, Shie Mannor, and Huan Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189, 2014

2014

[61] [61]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017

2017

[62] [62]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012

2012

[63] [63]

Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996

1996

[64] [64]

Springer, 2008

Alexandre B Tsybakov.Introduction to Nonparametric Estimation. Springer, 2008

2008

[65] [65]

Mesures dans les espaces produits.Atti Accad

CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949

1949

[66] [66]

Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978

Jan Van der Wal. Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978

1978

[67] [67]

Cambridge Univer- sity Press, 2019

Martin J Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Univer- sity Press, 2019

2019

[68] [68]

Policy gradient in robust MDPs with global conver- gence guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust MDPs with global conver- gence guarantee. InInternational Conference on Machine Learning, pages 35763–35797, 2023

2023

[69] [69]

A finite sample complexity bound for distributionally robustQ-learning

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robustQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 3370–3398, 2023

2023

[70] [70]

Policy gradient method for robust reinforcement learning

Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational Conference on Machine Learning, pages 23484–23526, 2022

2022

[71] [71]

Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013

Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 29

2013

[72] [72]

Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium

Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. InConference on Learning Theory, pages 3674–3682, 2020

2020

[73] [73]

Improved sample complexity bounds for distri- butionally robust reinforcement learning

Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distri- butionally robust reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 9728–9754, 2023

2023

[74] [74]

Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024

Yuling Yan, Gen Li, Yuxin Chen, and Jianqing Fan. Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024

2024

[75] [75]

Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994

1994

[76] [76]

Regularized gradient descent ascent for two-player zero-sum Markov games

Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum Markov games. InNeural Information Processing Systems, pages 34546–34558, 2022

2022

[77] [77]

Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023

Kaiqing Zhang, Sham M Kakade, Tamer Basar, and Lin F Yang. Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023

2023

[78] [78]

Provably efficient policy optimization for two- player zero-sum Markov games

Yulai Zhao, Yuandong Tian, Jason Lee, and Simon Du. Provably efficient policy optimization for two- player zero-sum Markov games. InInternational Conference on Artificial Intelligence and Statistics, pages 2736–2761, 2022

2022

[79] [79]

Natural actor-critic for robust reinforcement learning with function approximation

Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor-critic for robust reinforcement learning with function approximation. InNeural Information Processing Systems, pages 97–133, 2023. A Appendix We introduce the following non-robust Bellman evaluation operator, namely for allv∈C b(S) ands∈ S, T P π v(s) = X a∈A π(a|s) r(...

2023