Robust Markov Decision Processes on Continuous State Spaces
Pith reviewed 2026-06-29 10:28 UTC · model grok-4.3
The pith
Stochastic first-order method achieves Õ(1/ε²) sample complexity for robust MDPs on continuous state spaces
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We utilize the dynamic formulation of the corresponding robust MDPs, and subsequently introduce a stochastic first-order method for robust policy evaluation. We establish its high probability convergence to the robust value function, which in turn leads to an Õ(1/ε²) sample complexity. This high probability accuracy certificate is then used in an approximate policy iteration method that finds an ε-optimal policy with Õ(1/ε²) samples. The obtained sample complexities for both robust policy evaluation and optimization appear to be new for robust MDPs with continuous state spaces. Of independent interest, the proposed method is also directly applicable to zero-sum Markov games, which seems to s
What carries the argument
Stochastic first-order method applied to the dynamic programming formulation of the robust MDP
If this is right
- Robust policy evaluation converges with high probability using Õ(1/ε²) samples.
- Approximate policy iteration identifies an ε-optimal policy with Õ(1/ε²) samples.
- The method applies to zero-sum Markov games and improves existing sample complexities for continuous state spaces.
Where Pith is reading between the lines
- The avoidance of discretization suggests the approach could scale directly to higher-dimensional robust control problems.
- Similar first-order techniques might transfer to robust optimization problems outside MDPs that share rectangular uncertainty structure.
- Empirical tests on benchmark continuous-state problems would check whether the theoretical sample bounds translate to observed performance.
Load-bearing premise
The rectangular ambiguity set lies inside the convex hull of unknown generating kernels so the dynamic program remains tractable for direct application of stochastic first-order methods without discretization.
What would settle it
A specific continuous-state robust MDP where robust policy evaluation requires more than Õ(1/ε²) samples to reach ε-accuracy would contradict the stated high-probability convergence.
read the original abstract
We study infinite-horizon robust Markov decision processes (MDPs) on continuous state spaces with structured rectangular ambiguity set. The proposed ambiguity set falls within the convex hull of unknown generating kernels. We utilize the dynamic formulation of the corresponding robust MDPs, and subsequently introduce a stochastic first-order method for robust policy evaluation. We establish its high probability convergence to the robust value function, which in turn leads to an $\widetilde{\mathcal O}(1/\epsilon^2)$ sample complexity. This high probability accuracy certificate is then used in an approximate policy iteration method that finds an $\epsilon$-optimal policy with $\widetilde{\mathcal O}(1/\epsilon^2)$ samples. The obtained sample complexities for both robust policy evaluation and optimization appear to be new for robust MDPs with continuous state spaces. Of independent interest, the proposed method is also directly applicable to zero-sum Markov games, which seems to strictly improve the existing sample complexities for continuous state spaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies infinite-horizon robust MDPs on continuous state spaces under a structured rectangular ambiguity set contained in the convex hull of unknown generating kernels. It employs the dynamic programming formulation to develop a stochastic first-order method for robust policy evaluation that achieves high-probability convergence to the robust value function with ilde O(1/ε²) sample complexity. This certificate is then lifted to an approximate policy iteration procedure that yields an ε-optimal policy with the same sample complexity. The approach is also shown to apply directly to zero-sum Markov games, improving existing sample complexities for continuous-state settings.
Significance. If the stated high-probability bounds and sample complexities hold, the work would provide the first such guarantees for robust MDPs and games on continuous state spaces, extending beyond the discrete-state literature. The direct application of stochastic first-order methods to the continuous-state dynamic program without explicit discretization is a technical strength, and the rectangular ambiguity set construction enables the claimed tractability.
major comments (1)
- [Abstract and §3 (policy evaluation)] The central sample-complexity claims rest on the assumption that the rectangular ambiguity set lies inside the convex hull of the generating kernels and that the resulting robust Bellman operator remains amenable to stochastic first-order analysis without incurring approximation errors that would invalidate the ilde O(1/ε²) bound. This assumption is load-bearing for both the policy-evaluation and policy-iteration results; a concrete verification that the continuous-state technicalities (e.g., measurability and Lipschitz conditions on the kernels) are handled without hidden discretization is required in the proof of Theorem 3.1 or the corresponding section.
minor comments (2)
- Notation for the ambiguity set and the stochastic gradient estimator should be introduced with explicit dependence on the continuous state variable to avoid ambiguity when the method is applied to games.
- The high-probability statement would benefit from an explicit statement of the failure probability δ and how it enters the ilde O notation.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for recognizing the potential significance of the first high-probability sample-complexity guarantees for continuous-state robust MDPs. We address the single major comment below and are prepared to revise the manuscript to strengthen the exposition of the technical conditions.
read point-by-point responses
-
Referee: [Abstract and §3 (policy evaluation)] The central sample-complexity claims rest on the assumption that the rectangular ambiguity set lies inside the convex hull of the generating kernels and that the resulting robust Bellman operator remains amenable to stochastic first-order analysis without incurring approximation errors that would invalidate the ilde O(1/ε²) bound. This assumption is load-bearing for both the policy-evaluation and policy-iteration results; a concrete verification that the continuous-state technicalities (e.g., measurability and Lipschitz conditions on the kernels) are handled without hidden discretization is required in the proof of Theorem 3.1 or the corresponding section.
Authors: We agree that explicit verification of these conditions is important for the continuous-state setting. In the current proof of Theorem 3.1 we start from the dynamic-programming formulation and apply the stochastic first-order method directly to the robust Bellman operator induced by the rectangular ambiguity set contained in the convex hull of the generating kernels. The proof invokes the standing measurability and Lipschitz assumptions stated in Section 2 on the kernels; these ensure that the robust operator remains a contraction mapping with the same modulus as the nominal case and that the stochastic gradients are unbiased with bounded variance, allowing the high-probability convergence analysis to proceed via standard martingale concentration without any discretization step. The same argument carries over to the approximate policy iteration result. To make the verification fully self-contained, we will add a short dedicated paragraph (or subsection) immediately after the statement of Theorem 3.1 that explicitly checks the required measurability, Lipschitz, and unbiased-gradient properties under the rectangular construction and confirms that no auxiliary discretization is introduced. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents a stochastic first-order method for robust policy evaluation on continuous-state robust MDPs, deriving high-probability convergence and the resulting ilde O(1/ε²) sample complexity directly from the method's analysis applied to the dynamic programming formulation. This is then lifted to approximate policy iteration. No step reduces a claimed prediction or result to a fitted input, self-defined quantity, or load-bearing self-citation by construction; the bounds are obtained from standard stochastic optimization convergence arguments without circular reduction to the paper's own inputs or prior fitted constants. The extension to Markov games is stated as a direct applicability without invoking uniqueness theorems or ansatzes from overlapping prior work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The ambiguity set is structured rectangular and contained in the convex hull of unknown generating kernels.
- domain assumption The dynamic formulation of the robust MDP admits a stochastic first-order method whose iterates converge to the robust value function.
Forward citations
Cited by 1 Pith paper
-
Policy Gradient for Continuous-Time Robust Markov Decision Processes
Extends robust MDPs to continuous time with policy gradient derivations using differential equation methods and proposes optimizers achieving linear convergence and specific sample complexities.
Reference graph
Works this paper leans on
-
[1]
A natural actor-critic framework for zero-sum Markov games
Ahmet Alacaoglu, Luca Viano, Niao He, and Volkan Cevher. A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366, 2022
2022
-
[2]
Springer, 2006
Charalambos D Aliprantis and Kim C Border.Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2006
2006
-
[3]
A finite time analysis of temporal difference learning with linear function approximation
Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on Learning Theory, pages 1691–1692, 2018
2018
-
[4]
Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017
Stephen Boyd, Enzo Busseti, Steve Diamond, Ronald N Kahn, Kwangmoo Koh, Peter Nystrup, and Jan Speth. Multi-period trading via convex optimization.Foundations and Trends®in Optimization, 3(1):1–76, 2017
2017
-
[5]
Shicong Cen, Yuejie Chi, Simon S Du, and Lin Xiao. Faster last-iterate convergence of policy optimiza- tion in zero-sum Markov games.arXiv preprint arXiv:2210.01050, 2022
-
[6]
A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011
Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011
2011
-
[7]
A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games
Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. InNeural Information Processing Systems, volume 36, pages 75826–75883, 2023
2023
-
[8]
Almost optimal algorithms for two-player zero-sum linear mixture Markov games
Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zero-sum linear mixture Markov games. InInternational Conference on Algorithmic Learning Theory, pages 227–261, 2022
2022
-
[9]
Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016
Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016
2016
-
[10]
Cambridge Uni- versity Press, 2014
Giuseppe Da Prato and Jerzy Zabczyk.Stochastic Equations in Infinite Dimensions. Cambridge Uni- versity Press, 2014
2014
-
[11]
From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021
Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, and Junyu Zhang. From low probability to high confi- dence in stochastic convex optimization.Journal of machine learning research, 22(49):1–38, 2021
2021
-
[12]
Robust Markov decision process: Beyond rectangularity
Vineet Goyal and Julien Grand-Clement. Robust Markov decision process: Beyond rectangularity. Mathematics of Operations Research, 48(1):203–226, 2022
2022
-
[13]
Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021
Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial policy iteration forl 1-robust Markov decision processes.Journal of Machine Learning Research, 22(1):12612–12657, 2021
2021
-
[14]
Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016
Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails.Journal of Machine Learning Research, 17(18):1–40, 2016
2016
-
[15]
Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018
Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? InInternational Conference on Machine Learning, pages 2029–2037, 2018
2029
-
[16]
Baihe Huang, Jason D Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum Markov games.arXiv preprint arXiv:2107.14702, 2021. 26
-
[17]
Christian D Hubbs, Hector D Perez, Owais Sarwar, Nikolaos V Sahinidis, Ignacio E Grossmann, and John M Wassick. OR-Gym: A reinforcement learning library for operations research problems.arXiv preprint arXiv:2008.06319, 2020
-
[18]
Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005
Garud Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005
2005
-
[19]
The power of exploiter: Provable multi-agent RL in large state spaces
Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent RL in large state spaces. InInternational Conference on Machine Learning, pages 10251–10279, 2022
2022
-
[20]
Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,
Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715, 2022
-
[21]
Approximately optimal approximate reinforcement learning
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pages 267–274, 2002
2002
-
[22]
Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, ii: Markovian noise and policy evaluation in reinforcement learning.SIAM Journal on Optimization, 32(2):1120–1155, 2022
2022
-
[23]
Policy gradient for rectangular robust Markov decision processes
Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy gradient for rectangular robust Markov decision processes. InNeural Information Processing Systems, pages 59477–59501, 2023
2023
-
[24]
Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, and Shie Mannor. Dual formulation for non-rectangularL p robust Markov decision processes.arXiv preprint arXiv:2502.09432, 2025
-
[25]
Efficient value iteration for s- rectangular robust Markov decision processes
Navdeep Kumar, Kaixin Wang, Kfir Yehuda Levy, and Shie Mannor. Efficient value iteration for s- rectangular robust Markov decision processes. InInternational Conference on Machine Learning, pages 25682–25725, 2024
2024
-
[26]
A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023
Guanghui Lan and Yan Li. A novel catalyst scheme for stochastic minimax optimization.arXiv preprint arXiv:2311.02814, 2023
-
[27]
PhD thesis, Massachusetts Institute of Technology, 2007
Yann Le Tallec.Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 2007
2007
-
[28]
Minimax-optimal multi-agent RL in Markov games with a generative model
Gen Li, Yuejie Chi, Yuting Wei, and Yuxin Chen. Minimax-optimal multi-agent RL in Markov games with a generative model. InNeural Information Processing Systems, pages 15353–15367, 2022
2022
-
[29]
Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023
-
[30]
Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025
Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Towards optimal offline reinforcement learning.arXiv preprint arXiv:2503.12283, 2025
-
[31]
Sample Average Approximation for Distributionally Robust Optimization with $\phi$-divergences
Yan Li. Sample average approximation for distributionally robust optimization withϕ-divergences. arXiv preprint arXiv:2604.10855, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023
Yan Li and Guanghui Lan. First-order policy optimization for robust policy evaluation.arXiv preprint arXiv:2307.15890, 2023
-
[33]
Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025
Yan Li and Alexander Shapiro. Rectangularity and duality of distributionally robust Markov decision processes.Mathematical Programming, 2025
2025
-
[34]
Yan Li, Tuo Zhao, and Guanghui Lan. First-order policy optimization for robust Markov decision process.arXiv preprint arXiv:2209.10579, 2022
-
[35]
Zhenghao Li, Shengbo Wang, and Nian Si. Near-optimal sample complexities of divergence-based s- rectangular distributionally robust reinforcement learning.arXiv preprint arXiv:2505.12202, 2025. 27
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Reinforcement learning for robust parameterized locomotion control of bipedal robots
Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Reinforcement learning for robust parameterized locomotion control of bipedal robots. In IEEE International Conference on Robotics and Automation, pages 2811–2817, 2021
2021
-
[37]
A single-loop robust policy gradient method for robust Markov decision processes
Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A single-loop robust policy gradient method for robust Markov decision processes. InInternational Conference on Machine Learning, pages 30392 – 30426, 2024
2024
-
[38]
Distributionally robustQ-learning
Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robustQ-learning. InInternational Conference on Machine Learning, pages 13623– 13643, 2022
2022
-
[39]
Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008
R´ emi Munos and Csaba Szepesv´ ari. Finite-time bounds for fitted value iteration.Journal of Machine Learning Research, 9(27):815–857, 2008
2008
-
[40]
Arkadi Nemirovski. Prox-method with rate of convergenceO(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15(1):229–251, 2004
2004
-
[41]
Problem complexity and method efficiency in optimization
Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983
1983
-
[42]
Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005
Yu Nesterov. Smooth minimization of non-smooth functions.Mathematical Programming, 103(1):127– 152, 2005
2005
-
[43]
Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009
Yurii Nesterov. Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259, 2009
2009
-
[44]
Markov decision processes under model uncertainty
Ariel Neufeld, Julian Sester, and Mario ˇSiki´ c. Markov decision processes under model uncertainty. Mathematical Finance, 33(3):618–665, 2023
2023
-
[45]
Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005
Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005
2005
-
[46]
Sample complexity of robust reinforcement learning with a generative model
Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582– 9602, 2022
2022
-
[47]
Robust reinforcement learning using offline data
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust reinforcement learning using offline data. InNeural Information Processing Systems, pages 32211–32224, 2022
2022
-
[48]
Patek.Stochastic Shortest Path Games: Theory and Algorithms
Stephen D. Patek.Stochastic Shortest Path Games: Theory and Algorithms. PhD thesis, Massachusetts Institute of Technology, 1997
1997
-
[49]
Sim-to-real transfer of robotic control with dynamics randomization
Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation, pages 3803–3810, 2018
2018
-
[50]
Approximate dynamic programming for two-player zero-sum Markov games
Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum Markov games. InInternational Conference on Machine Learning, pages 1321– 1329, 2015
2015
-
[51]
Random features for large-scale kernel machines
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InNeural Information Processing Systems, 2007
2007
-
[52]
Distribu- tionally robust model-based reinforcement learning with large state spaces
Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distribu- tionally robust model-based reinforcement learning with large state spaces. InInternational Conference on Artificial Intelligence and Statistics, pages 100–108, 2024. 28
2024
-
[53]
Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017
Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Optimization, 27(4):2258–2275, 2017
2017
-
[54]
The curious price of distributional robustness in reinforcement learning with a generative model
Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. The curious price of distributional robustness in reinforcement learning with a generative model. InNeural Information Processing Systems, pages 79903–79917, 2023
2023
-
[55]
Springer, 2009
Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo.Robotics: Modelling, Planning and Control. Springer, 2009
2009
-
[56]
Near-optimal time and sample complexities for solving Markov decision processes with a generative model
Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. InNeural Information Processing Systems, 2018
2018
-
[57]
Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023
Zhuoqing Song, Jason D Lee, and Zhuoran Yang. Can we find Nash equilibria at a linear rate in Markov games?arXiv preprint arXiv:2303.03095, 2023
-
[58]
Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988
Richard S Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988
1988
-
[59]
Dynamic control of autonomous quadrotor flight in an estimated wind field
Nitin Sydney, Brendan Smyth, and Derek A Paley. Dynamic control of autonomous quadrotor flight in an estimated wind field. In52nd IEEE Conference on Decision and Control, pages 3609–3616. IEEE, 2013
2013
-
[60]
Scaling up robust MDPs using function approximation
Aviv Tamar, Shie Mannor, and Huan Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189, 2014
2014
-
[61]
Domain randomization for transferring deep neural networks from simulation to the real world
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017
2017
-
[62]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012
2012
-
[63]
Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996
John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approx- imation.Advances in neural information processing systems, 9, 1996
1996
-
[64]
Springer, 2008
Alexandre B Tsybakov.Introduction to Nonparametric Estimation. Springer, 2008
2008
-
[65]
Mesures dans les espaces produits.Atti Accad
CT Ionescu Tulcea. Mesures dans les espaces produits.Atti Accad. Naz. Lincei Rend, 7:208–211, 1949
1949
-
[66]
Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978
Jan Van der Wal. Discounted Markov games: Generalized policy iteration method.Journal of Opti- mization Theory and Applications, 25(1):125–138, 1978
1978
-
[67]
Cambridge Univer- sity Press, 2019
Martin J Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Univer- sity Press, 2019
2019
-
[68]
Policy gradient in robust MDPs with global conver- gence guarantee
Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust MDPs with global conver- gence guarantee. InInternational Conference on Machine Learning, pages 35763–35797, 2023
2023
-
[69]
A finite sample complexity bound for distributionally robustQ-learning
Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robustQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 3370–3398, 2023
2023
-
[70]
Policy gradient method for robust reinforcement learning
Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. InInternational Conference on Machine Learning, pages 23484–23526, 2022
2022
-
[71]
Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013
Wolfram Wiesemann, Daniel Kuhn, and Ber¸ c Rustem. Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 29
2013
-
[72]
Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium
Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. InConference on Learning Theory, pages 3674–3682, 2020
2020
-
[73]
Improved sample complexity bounds for distri- butionally robust reinforcement learning
Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distri- butionally robust reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 9728–9754, 2023
2023
-
[74]
Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024
Yuling Yan, Gen Li, Yuxin Chen, and Jianqing Fan. Model-based reinforcement learning for offline zero-sum Markov games.Operations Research, 72(6):2430–2445, 2024
2024
-
[75]
Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994
Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, 22(1):94–116, 1994
1994
-
[76]
Regularized gradient descent ascent for two-player zero-sum Markov games
Sihan Zeng, Thinh Doan, and Justin Romberg. Regularized gradient descent ascent for two-player zero-sum Markov games. InNeural Information Processing Systems, pages 34546–34558, 2022
2022
-
[77]
Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023
Kaiqing Zhang, Sham M Kakade, Tamer Basar, and Lin F Yang. Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity.Journal of Machine Learning Research, 24(175):1–53, 2023
2023
-
[78]
Provably efficient policy optimization for two- player zero-sum Markov games
Yulai Zhao, Yuandong Tian, Jason Lee, and Simon Du. Provably efficient policy optimization for two- player zero-sum Markov games. InInternational Conference on Artificial Intelligence and Statistics, pages 2736–2761, 2022
2022
-
[79]
Natural actor-critic for robust reinforcement learning with function approximation
Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor-critic for robust reinforcement learning with function approximation. InNeural Information Processing Systems, pages 97–133, 2023. A Appendix We introduce the following non-robust Bellman evaluation operator, namely for allv∈C b(S) ands∈ S, T P π v(s) = X a∈A π(a|s) r(...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.