pith. sign in

arxiv: 2605.28706 · v1 · pith:S36N4V3Knew · submitted 2026-05-27 · 🧮 math.OC

Robust Markov Decision Processes on Continuous State Spaces

Pith reviewed 2026-06-29 10:28 UTC · model grok-4.3

classification 🧮 math.OC
keywords robust MDPscontinuous state spacesstochastic first-order methodpolicy evaluationpolicy iterationsample complexityambiguity setsMarkov games
0
0 comments X

The pith

Stochastic first-order method achieves Õ(1/ε²) sample complexity for robust MDPs on continuous state spaces

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce a stochastic first-order method for robust policy evaluation in infinite-horizon robust MDPs on continuous state spaces. The method uses a structured rectangular ambiguity set contained in the convex hull of unknown kernels. It establishes high probability convergence to the robust value function, resulting in an Õ(1/ε²) sample complexity. This is then applied in approximate policy iteration to find an ε-optimal policy with the same complexity. The results are new for continuous state spaces and extend to zero-sum Markov games.

Core claim

We utilize the dynamic formulation of the corresponding robust MDPs, and subsequently introduce a stochastic first-order method for robust policy evaluation. We establish its high probability convergence to the robust value function, which in turn leads to an Õ(1/ε²) sample complexity. This high probability accuracy certificate is then used in an approximate policy iteration method that finds an ε-optimal policy with Õ(1/ε²) samples. The obtained sample complexities for both robust policy evaluation and optimization appear to be new for robust MDPs with continuous state spaces. Of independent interest, the proposed method is also directly applicable to zero-sum Markov games, which seems to s

What carries the argument

Stochastic first-order method applied to the dynamic programming formulation of the robust MDP

If this is right

  • Robust policy evaluation converges with high probability using Õ(1/ε²) samples.
  • Approximate policy iteration identifies an ε-optimal policy with Õ(1/ε²) samples.
  • The method applies to zero-sum Markov games and improves existing sample complexities for continuous state spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The avoidance of discretization suggests the approach could scale directly to higher-dimensional robust control problems.
  • Similar first-order techniques might transfer to robust optimization problems outside MDPs that share rectangular uncertainty structure.
  • Empirical tests on benchmark continuous-state problems would check whether the theoretical sample bounds translate to observed performance.

Load-bearing premise

The rectangular ambiguity set lies inside the convex hull of unknown generating kernels so the dynamic program remains tractable for direct application of stochastic first-order methods without discretization.

What would settle it

A specific continuous-state robust MDP where robust policy evaluation requires more than Õ(1/ε²) samples to reach ε-accuracy would contradict the stated high-probability convergence.

read the original abstract

We study infinite-horizon robust Markov decision processes (MDPs) on continuous state spaces with structured rectangular ambiguity set. The proposed ambiguity set falls within the convex hull of unknown generating kernels. We utilize the dynamic formulation of the corresponding robust MDPs, and subsequently introduce a stochastic first-order method for robust policy evaluation. We establish its high probability convergence to the robust value function, which in turn leads to an $\widetilde{\mathcal O}(1/\epsilon^2)$ sample complexity. This high probability accuracy certificate is then used in an approximate policy iteration method that finds an $\epsilon$-optimal policy with $\widetilde{\mathcal O}(1/\epsilon^2)$ samples. The obtained sample complexities for both robust policy evaluation and optimization appear to be new for robust MDPs with continuous state spaces. Of independent interest, the proposed method is also directly applicable to zero-sum Markov games, which seems to strictly improve the existing sample complexities for continuous state spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper studies infinite-horizon robust MDPs on continuous state spaces under a structured rectangular ambiguity set contained in the convex hull of unknown generating kernels. It employs the dynamic programming formulation to develop a stochastic first-order method for robust policy evaluation that achieves high-probability convergence to the robust value function with ilde O(1/ε²) sample complexity. This certificate is then lifted to an approximate policy iteration procedure that yields an ε-optimal policy with the same sample complexity. The approach is also shown to apply directly to zero-sum Markov games, improving existing sample complexities for continuous-state settings.

Significance. If the stated high-probability bounds and sample complexities hold, the work would provide the first such guarantees for robust MDPs and games on continuous state spaces, extending beyond the discrete-state literature. The direct application of stochastic first-order methods to the continuous-state dynamic program without explicit discretization is a technical strength, and the rectangular ambiguity set construction enables the claimed tractability.

major comments (1)
  1. [Abstract and §3 (policy evaluation)] The central sample-complexity claims rest on the assumption that the rectangular ambiguity set lies inside the convex hull of the generating kernels and that the resulting robust Bellman operator remains amenable to stochastic first-order analysis without incurring approximation errors that would invalidate the ilde O(1/ε²) bound. This assumption is load-bearing for both the policy-evaluation and policy-iteration results; a concrete verification that the continuous-state technicalities (e.g., measurability and Lipschitz conditions on the kernels) are handled without hidden discretization is required in the proof of Theorem 3.1 or the corresponding section.
minor comments (2)
  1. Notation for the ambiguity set and the stochastic gradient estimator should be introduced with explicit dependence on the continuous state variable to avoid ambiguity when the method is applied to games.
  2. The high-probability statement would benefit from an explicit statement of the failure probability δ and how it enters the ilde O notation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for recognizing the potential significance of the first high-probability sample-complexity guarantees for continuous-state robust MDPs. We address the single major comment below and are prepared to revise the manuscript to strengthen the exposition of the technical conditions.

read point-by-point responses
  1. Referee: [Abstract and §3 (policy evaluation)] The central sample-complexity claims rest on the assumption that the rectangular ambiguity set lies inside the convex hull of the generating kernels and that the resulting robust Bellman operator remains amenable to stochastic first-order analysis without incurring approximation errors that would invalidate the ilde O(1/ε²) bound. This assumption is load-bearing for both the policy-evaluation and policy-iteration results; a concrete verification that the continuous-state technicalities (e.g., measurability and Lipschitz conditions on the kernels) are handled without hidden discretization is required in the proof of Theorem 3.1 or the corresponding section.

    Authors: We agree that explicit verification of these conditions is important for the continuous-state setting. In the current proof of Theorem 3.1 we start from the dynamic-programming formulation and apply the stochastic first-order method directly to the robust Bellman operator induced by the rectangular ambiguity set contained in the convex hull of the generating kernels. The proof invokes the standing measurability and Lipschitz assumptions stated in Section 2 on the kernels; these ensure that the robust operator remains a contraction mapping with the same modulus as the nominal case and that the stochastic gradients are unbiased with bounded variance, allowing the high-probability convergence analysis to proceed via standard martingale concentration without any discretization step. The same argument carries over to the approximate policy iteration result. To make the verification fully self-contained, we will add a short dedicated paragraph (or subsection) immediately after the statement of Theorem 3.1 that explicitly checks the required measurability, Lipschitz, and unbiased-gradient properties under the rectangular construction and confirms that no auxiliary discretization is introduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a stochastic first-order method for robust policy evaluation on continuous-state robust MDPs, deriving high-probability convergence and the resulting ilde O(1/ε²) sample complexity directly from the method's analysis applied to the dynamic programming formulation. This is then lifted to approximate policy iteration. No step reduces a claimed prediction or result to a fitted input, self-defined quantity, or load-bearing self-citation by construction; the bounds are obtained from standard stochastic optimization convergence arguments without circular reduction to the paper's own inputs or prior fitted constants. The extension to Markov games is stated as a direct applicability without invoking uniqueness theorems or ansatzes from overlapping prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the rectangular structure of the ambiguity set, its representation as the convex hull of unknown kernels, and the validity of the dynamic programming formulation for robust value functions on continuous spaces.

axioms (2)
  • domain assumption The ambiguity set is structured rectangular and contained in the convex hull of unknown generating kernels.
    Stated directly in the abstract as the modeling choice for the robust MDP.
  • domain assumption The dynamic formulation of the robust MDP admits a stochastic first-order method whose iterates converge to the robust value function.
    Invoked when the paper 'utilizes the dynamic formulation' to introduce the stochastic method.

pith-pipeline@v0.9.1-grok · 5689 in / 1569 out tokens · 30134 ms · 2026-06-29T10:28:36.011812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Policy Gradient for Continuous-Time Robust Markov Decision Processes

    cs.LG 2026-06 unverdicted novelty 7.0

    Extends robust MDPs to continuous time with policy gradient derivations using differential equation methods and proposes optimizers achieving linear convergence and specific sample complexities.

Reference graph

Works this paper leans on

79 extracted references · 13 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    A natural actor-critic framework for zero-sum Markov games

    Ahmet Alacaoglu, Luca Viano, Niao He, and Volkan Cevher. A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366, 2022

  2. [2]

    Springer, 2006

    Charalambos D Aliprantis and Kim C Border.Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2006

  3. [3]

    A finite time analysis of temporal difference learning with linear function approximation

    Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on Learning Theory, pages 1691–1692, 2018

  4. [4]

    Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017

    Stephen Boyd, Enzo Busseti, Steve Diamond, Ronald N Kahn, Kwangmoo Koh, Peter Nystrup, and Jan Speth. Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017

  5. [5]

    Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022

    Shicong Cen, Yuejie Chi, Simon S Du, and Lin Xiao. Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022

  6. [6]

    A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011

    Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011

  7. [7]

    A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games

    Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. InNeural Information Processing Systems, volume 36, pages 75826–75883, 2023

  8. [8]

    Almost optimal algorithms for two-player zero-sum linear mixture Markov games

    Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zero-sum linear mixture Markov games. InInternational Conference on Algorithmic Learning Theory, pages 227–261, 2022

  9. [9]

    Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

    Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

  10. [10]

    Cambridge Uni- versity Press, 2014

    Giuseppe Da Prato and Jerzy Zabczyk.Stochastic Equations in Infinite Dimensions. Cambridge Uni- versity Press, 2014

  11. [11]

    From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021

    Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, and Junyu Zhang. From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021

  12. [12]

    Robust Markov decision process: Beyond rectangularity

    Vineet Goyal and Julien Grand-Clement. Robust Markov decision process: Beyond rectangularity. Mathematics of Operations Research, 48(1):203–226, 2022

  13. [13]

    Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021

    Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021

  14. [14]

    Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016

    Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016

  15. [15]

    Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018

    Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018

  16. [16]

    Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021

    Baihe Huang, Jason D Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021. 26

  17. [17]

    OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020

    Christian D Hubbs, Hector D Perez, Owais Sarwar, Nikolaos V Sahinidis, Ignacio E Grossmann, and John M Wassick. OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020

  18. [18]

    Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

    Garud Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

  19. [19]

    The power of exploiter: Provable multi-agent RL in large state spaces

    Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent RL in large state spaces. InInternational Conference on Machine Learning, pages 10251–10279, 2022

  20. [20]

    Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

    Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715, 2022

  21. [21]

    Approximately optimal approximate reinforcement learning

    Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pages 267–274, 2002

  22. [22]

    Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning.SIAM Journal on Optimization, 32(2):1120–1155, 2022

  23. [23]

    Policy gradient for rectangular robust Markov decision processes

    Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy gradient for rectangular robust Markov decision processes. InNeural Information Processing Systems, pages 59477–59501, 2023

  24. [24]

    Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025

    Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, and Shie Mannor. Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025

  25. [25]

    Efficient value iteration for s- rectangular robust Markov decision processes

    Navdeep Kumar, Kaixin Wang, Kfir Yehuda Levy, and Shie Mannor. Efficient value iteration for s- rectangular robust Markov decision processes. InInternational Conference on Machine Learning, pages 25682–25725, 2024

  26. [26]

    A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023

    Guanghui Lan and Yan Li. A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023

  27. [27]

    PhD thesis, Massachusetts Institute of Technology, 2007

    Yann Le Tallec.Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 2007

  28. [28]

    Minimax-optimal multi-agent RL in Markov games with a generative model

    Gen Li, Yuejie Chi, Yuting Wei, and Yuxin Chen. Minimax-optimal multi-agent RL in Markov games with a generative model. InNeural Information Processing Systems, pages 15353–15367, 2022

  29. [29]

    Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

    Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

  30. [30]

    Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025

    Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025

  31. [31]

    Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences

    Yan Li. Sample average approximation for distributionally robust optimization withϕ-divergences. arXiv preprint arXiv:2604.10855, 2026

  32. [32]

    First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023

    Yan Li and Guanghui Lan. First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023

  33. [33]

    Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025

    Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025

  34. [34]

    First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022

    Yan Li, Tuo Zhao, and Guanghui Lan. First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022

  35. [35]

    Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

    Zhenghao Li, Shengbo Wang, and Nian Si. Near-optimal sample complexities of divergence-based s- rectangular distributionally robust reinforcement learning.arXiv preprint arXiv:2505.12202, 2025. 27

  36. [36]

    Reinforcement learning for robust parameterized locomotion control of bipedal robots

    Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Reinforcement learning for robust parameterized locomotion control of bipedal robots. In IEEE International Conference on Robotics and Automation, pages 2811–2817, 2021

  37. [37]

    A single-loop robust policy gradient method for robust Markov decision processes

    Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A single-loop robust policy gradient method for robust Markov decision processes. InInternational Conference on Machine Learning, pages 30392 – 30426, 2024

  38. [38]

    Distributionally robustQ-learning

    Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robustQ-learning. InInternational Conference on Machine Learning, pages 13623– 13643, 2022

  39. [39]

    Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008

    R´ emi Munos and Csaba Szepesv´ ari. Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008

  40. [40]

    Arkadi Nemirovski. Prox-method with rate of convergenceO(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15(1):229–251, 2004

  41. [41]

    Problem complexity and method efficiency in optimization

    Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

  42. [42]

    Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005

    Yu Nesterov. Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005

  43. [43]

    Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009

    Yurii Nesterov. Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009

  44. [44]

    Markov decision processes under model uncertainty

    Ariel Neufeld, Julian Sester, and Mario ˇSiki´ c. Markov decision processes under model uncertainty. Mathematical Finance, 33(3):618–665, 2023

  45. [45]

    Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

    Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

  46. [46]

    Sample complexity of robust reinforcement learning with a generative model

    Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582– 9602, 2022

  47. [47]

    Robust reinforcement learning using offline data

    Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust reinforcement learning using offline data. InNeural Information Processing Systems, pages 32211–32224, 2022

  48. [48]

    Patek.Stochastic Shortest Path Games: Theory and Algorithms

    Stephen D. Patek.Stochastic Shortest Path Games: Theory and Algorithms. PhD thesis, Massachusetts Institute of Technology, 1997

  49. [49]

    Sim-to-real transfer of robotic control with dynamics randomization

    Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation, pages 3803–3810, 2018

  50. [50]

    Approximate dynamic programming for two-player zero-sum Markov games

    Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum Markov games. InInternational Conference on Machine Learning, pages 1321– 1329, 2015

  51. [51]

    Random features for large-scale kernel machines

    Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InNeural Information Processing Systems, 2007

  52. [52]

    Distribu- tionally robust model-based reinforcement learning with large state spaces

    Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distribu- tionally robust model-based reinforcement learning with large state spaces. InInternational Conference on Artificial Intelligence and Statistics, pages 100–108, 2024. 28

  53. [53]

    Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

    Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017

  54. [54]

    The curious price of distributional robustness in reinforcement learning with a generative model

    Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. The curious price of distributional robustness in reinforcement learning with a generative model. InNeural Information Processing Systems, pages 79903–79917, 2023

  55. [55]

    Springer, 2009

    Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo.Robotics: Modelling, Planning and Control. Springer, 2009

  56. [56]

    Near-optimal time and sample complexities for solving Markov decision processes with a generative model

    Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. InNeural Information Processing Systems, 2018

  57. [57]

    Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023

    Zhuoqing Song, Jason D Lee, and Zhuoran Yang. Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023

  58. [58]

    Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

    Richard S Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

  59. [59]

    Dynamic control of autonomous quadrotor flight in an estimated wind field

    Nitin Sydney, Brendan Smyth, and Derek A Paley. Dynamic control of autonomous quadrotor flight in an estimated wind field. In52nd IEEE Conference on Decision and Control, pages 3609–3616. IEEE, 2013

  60. [60]

    Scaling up robust MDPs using function approximation

    Aviv Tamar, Shie Mannor, and Huan Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189, 2014

  61. [61]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017

  62. [62]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012

  63. [63]

    Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996

    John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996

  64. [64]

    Springer, 2008

    Alexandre B Tsybakov.Introduction to Nonparametric Estimation. Springer, 2008

  65. [65]

    Mesures dans les espaces produits.Atti Accad

    CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949

  66. [66]

    Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978

    Jan Van der Wal. Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978

  67. [67]

    Cambridge Univer- sity Press, 2019

    Martin J Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Univer- sity Press, 2019

  68. [68]

    Policy gradient in robust MDPs with global conver- gence guarantee

    Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust MDPs with global conver- gence guarantee. InInternational Conference on Machine Learning, pages 35763–35797, 2023

  69. [69]

    A finite sample complexity bound for distributionally robustQ-learning

    Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robustQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 3370–3398, 2023

  70. [70]

    Policy gradient method for robust reinforcement learning

    Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational Conference on Machine Learning, pages 23484–23526, 2022

  71. [71]

    Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013

    Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 29

  72. [72]

    Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium

    Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. InConference on Learning Theory, pages 3674–3682, 2020

  73. [73]

    Improved sample complexity bounds for distri- butionally robust reinforcement learning

    Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distri- butionally robust reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 9728–9754, 2023

  74. [74]

    Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024

    Yuling Yan, Gen Li, Yuxin Chen, and Jianqing Fan. Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024

  75. [75]

    Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994

    Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994

  76. [76]

    Regularized gradient descent ascent for two-player zero-sum Markov games

    Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum Markov games. InNeural Information Processing Systems, pages 34546–34558, 2022

  77. [77]

    Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023

    Kaiqing Zhang, Sham M Kakade, Tamer Basar, and Lin F Yang. Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023

  78. [78]

    Provably efficient policy optimization for two- player zero-sum Markov games

    Yulai Zhao, Yuandong Tian, Jason Lee, and Simon Du. Provably efficient policy optimization for two- player zero-sum Markov games. InInternational Conference on Artificial Intelligence and Statistics, pages 2736–2761, 2022

  79. [79]

    Natural actor-critic for robust reinforcement learning with function approximation

    Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor-critic for robust reinforcement learning with function approximation. InNeural Information Processing Systems, pages 97–133, 2023. A Appendix We introduce the following non-robust Bellman evaluation operator, namely for allv∈C b(S) ands∈ S, T P π v(s) = X a∈A π(a|s) r(...