Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs
Pith reviewed 2026-05-23 21:27 UTC · model grok-4.3
The pith
PDR-ANPG reaches last-iterate ε optimality and constraint satisfaction for general parameterized policies in constrained MDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For parameterized policy classes with a transferred compatibility approximation error ε_bias, PDR-ANPG achieves a last-iterate ε optimality gap and ε constraint violation with a sample complexity of Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}). If the class is incomplete (ε_bias > 0), the sample complexity reduces to Õ(ε^{-2}) for ε < (ε_bias)^{1/6}. For complete policies with ε_bias = 0, the algorithm achieves the same last-iterate guarantees with Õ(ε^{-4}) sample complexity. This improves prior last-iterate results for general parameterized CMDPs.
What carries the argument
The PDR-ANPG algorithm, which combines primal-dual updates with entropy and quadratic regularizers to produce last-iterate convergence under a transferred compatibility approximation error.
If this is right
- When ε_bias > 0 the sample complexity improves to Õ(ε^{-2}) once the target accuracy ε falls below ε_bias to the power 1/6.
- When the policy class is complete and ε_bias = 0 the algorithm requires only Õ(ε^{-4}) samples to reach last-iterate ε optimality and feasibility.
- All guarantees are stated for last-iterate rather than averaged iterates.
- The bounds constitute an improvement over existing last-iterate sample-complexity results for general parameterized CMDPs.
Where Pith is reading between the lines
- In practice one could deliberately choose a policy class whose ε_bias is moderate and then stop at an accuracy level matched to that bias to obtain the cheaper Õ(ε^{-2}) regime.
- The explicit dependence on ε_bias offers a way to trade off policy-class expressivity against sample cost in applied constrained RL problems.
- If similar transferred compatibility errors can be defined for other constrained sequential decision problems, the same primal-dual regularized analysis may yield last-iterate rates there as well.
Load-bearing premise
The transferred compatibility approximation error ε_bias is a well-defined, bounded quantity for the chosen policy class that can be controlled independently of the optimization process.
What would settle it
A direct computation or simulation on a simple CMDP that measures whether the last-iterate optimality gap and constraint violation decrease at the stated rates when ε_bias is known and fixed.
read the original abstract
This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, $\epsilon_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($\epsilon_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(\epsilon^{-2})$ for $\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $\epsilon_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity. It is a significant improvement over the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm for solving constrained MDPs with general parameterized policies. It introduces a transferred compatibility approximation error ε_bias and claims last-iterate ε-optimality and ε-constraint violation with sample complexity Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}), which simplifies to Õ(ε^{-2}) when ε < (ε_bias)^{1/6} for incomplete classes (ε_bias > 0) and Õ(ε^{-4}) for complete classes (ε_bias = 0). These are presented as improvements over prior last-iterate results for parameterized CMDPs.
Significance. If the central claims hold, the work would advance last-iterate analysis for constrained RL with function approximation, providing explicit rates that degrade gracefully with policy-class incompleteness. The explicit dependence on ε_bias and the regime distinctions are potentially useful for practitioners selecting policy classes, though the improvement is incremental relative to existing primal-dual analyses in CMDPs.
major comments (2)
- [Abstract and definition of ε_bias] The headline rates in the abstract rest on treating ε_bias as a static, a-priori bounded constant decoupled from the primal-dual iterates. The transferred compatibility condition used to define ε_bias must be shown (in the section introducing the error term and in the convergence analysis) to remain independent of the current policy parameters and dual variables; otherwise the min{ε^{-2}, ε_bias^{-1/3}} and the ε < (ε_bias)^{1/6} regime do not follow.
- [Main theorem / convergence analysis] The sample-complexity statements claim last-iterate guarantees, yet the proof sketch (or main theorem) must explicitly reduce the bias term to the stated rates without additional trajectory-dependent factors. If the analysis invokes standard RL assumptions plus ε_bias without a self-contained bound on how ε_bias enters the last-iterate Lyapunov function, the Õ(ε^{-4}) claim for ε_bias = 0 and the improved incomplete-class rate are not yet load-bearing.
minor comments (2)
- [Preliminaries] Notation for the regularizers (entropy and quadratic) and the transferred compatibility condition should be introduced with explicit equations before the main theorem.
- [Abstract / Related work] The abstract states 'significant improvement over the state-of-the-art'; a table comparing the new rates to the cited prior last-iterate bounds would clarify the precise improvement.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. Below we respond point by point to the major comments, clarifying the independence of ε_bias and the structure of the convergence analysis. We will revise the manuscript to make these aspects more explicit.
read point-by-point responses
-
Referee: [Abstract and definition of ε_bias] The headline rates in the abstract rest on treating ε_bias as a static, a-priori bounded constant decoupled from the primal-dual iterates. The transferred compatibility condition used to define ε_bias must be shown (in the section introducing the error term and in the convergence analysis) to remain independent of the current policy parameters and dual variables; otherwise the min{ε^{-2}, ε_bias^{-1/3}} and the ε < (ε_bias)^{1/6} regime do not follow.
Authors: We agree that an explicit statement is warranted. In the manuscript, ε_bias is introduced in Section 3 (Definition 3.2) as the worst-case transferred compatibility error over the entire policy class and all dual variables in [0, λ_max]; it is therefore a fixed property of the policy class and does not depend on the current primal-dual iterates. The convergence analysis in the appendix already uses this uniform bound. To address the concern directly, we will add a short lemma in the revised main text (new Lemma 3.3) that formally proves independence from the iterates, thereby justifying the stated rates and the regime distinction. revision: yes
-
Referee: [Main theorem / convergence analysis] The sample-complexity statements claim last-iterate guarantees, yet the proof sketch (or main theorem) must explicitly reduce the bias term to the stated rates without additional trajectory-dependent factors. If the analysis invokes standard RL assumptions plus ε_bias without a self-contained bound on how ε_bias enters the last-iterate Lyapunov function, the Õ(ε^{-4}) claim for ε_bias = 0 and the improved incomplete-class rate are not yet load-bearing.
Authors: The proof of the main result (Theorem 4.1) and its supporting lemmas in Appendix B construct a last-iterate Lyapunov function that incorporates the bias term as an additive constant bounded by C·ε_bias (where C is independent of the trajectory length). The analysis then balances the regularization, primal-dual, and bias terms to obtain the claimed rates without introducing extra trajectory-dependent factors beyond the standard bounded-variance and smoothness assumptions already stated. We will insert a concise paragraph immediately after the statement of Theorem 4.1 that summarizes this reduction step, making the dependence on ε_bias fully explicit in the main text. revision: partial
Circularity Check
No circularity: ε_bias treated as exogenous class property
full rationale
The paper defines ε_bias as a transferred compatibility approximation error of the parameterized policy class, presented as a static, a priori bounded quantity decoupled from the primal-dual iterates. Sample-complexity claims (Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}) and reductions for ε < (ε_bias)^{1/6}) are stated under this assumption without any quoted reduction of the rates to fitted quantities, self-citations, or redefinitions that would make the result tautological. The derivation chain therefore remains self-contained against the stated assumptions and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The underlying CMDP satisfies the usual technical conditions (finite state-action spaces, proper discounting, bounded costs) required for policy-gradient analysis.
- domain assumption The transferred compatibility approximation error ε_bias is a fixed, policy-class-dependent constant that can be bounded independently of the learning process.
Forward citations
Cited by 1 Pith paper
-
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
Reference graph
Works this paper leans on
-
[1]
On the theory of policy gradient methods: Optimality , approximation, and distribution shi ft
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Maha jan. On the theory of policy gradient methods: Optimality , approximation, and distribution shi ft. The Journal of Machine Learning Research , 22(1):4431–4506, 2021
work page 2021
-
[2]
Achieving zero constraint violation for constrained reinforcement learn ing via primal-dual approach
Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel , and V aneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learn ing via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 36, pages 3682–3689, 2022
work page 2022
-
[3]
Qinbo Bai, Amrit Singh Bedi, and V aneet Aggarwal. Achiev ing zero constraint violation for con- strained reinforcement learning via conservative natural policy gradient primal-dual algorithm. In Proceedings of the AAAI Conference on Artificial Intelligen ce, volume 37, pages 6737–6744, 2023
work page 2023
-
[4]
Qinbo Bai, W ashim Uddin Mondal, and V aneet Aggarwal. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision process es. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10980–10988, 2024
work page 2024
-
[5]
On the linear convergen ce of policy gradient methods for finite mdps
Jalaj Bhandari and Daniel Russo. On the linear convergen ce of policy gradient methods for finite mdps. In International Conference on Artificial Intelligence and St atistics, pages 2386–2394. PMLR, 2021
work page 2021
-
[6]
Fast global convergence of natural policy gradient methods with entropy regularization
Shicong Cen, Chen Cheng, Yuxin Chen, Yuting W ei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022
work page 2022
-
[7]
Zaiwei Chen and Siva Theja Maguluri. Sample complexity o f policy-based methods under off-policy sampling and linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 11195–11214. PMLR, 2022
work page 2022
-
[8]
Last-iterate convergent policy gradient primal-dual methods for constrained mdps
Dongsheng Ding, Chen-Yu W ei, Kaiqing Zhang, and Alejand ro Ribeiro. Last-iterate convergent policy gradient primal-dual methods for constrained mdps. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[9]
Provably effi- cient safe exploration via primal-dual policy optimizatio n
Dongsheng Ding, Xiaohan W ei, Zhuoran Y ang, Zhaoran W ang, and Mihailo Jovanovic. Provably effi- cient safe exploration via primal-dual policy optimizatio n. In International conference on artificial intelli- gence and statistics , pages 3304–3312. PMLR, 2021. 11
work page 2021
-
[10]
Natural policy gradient primal- dual method for constrained markov decision processes
Dongsheng Ding, Kaiqing Zhang, T amer Basar, and Mihailo Jovanovic. Natural policy gradient primal- dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020
work page 2020
-
[11]
Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020
Y onathan Efroni, Shie Mannor, and Matteo Pirotta. Expl oration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189 , 2020
-
[12]
Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate poli cies
Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, an d Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate poli cies. In International Conference on Machine Learning, pages 9827–9869. PMLR, 2023
work page 2023
-
[13]
Matilde Gargiani, Andrea Zanelli, Andrea Martinelli, Tyler Summers, and John Lygeros. Page-pg: A simple and loopless variance-reduced policy gradient meth od with probabilistic gradient estimation. In International Conference on Machine Learning , pages 7223–7240. PMLR, 2022
work page 2022
-
[14]
Algorithm for constrained markov decisio n process with linear convergence
Egor Gladin, Maksim Lavrik-Karmazin, Karina Zainulli na, V arvara Rudenko, Alexander Gasnikov , and Martin T akac. Algorithm for constrained markov decisio n process with linear convergence. In International Conference on Artificial Intelligence and St atistics, pages 11506–11533. PMLR, 2023
work page 2023
-
[15]
Nearly minima x optimal reinforcement learning for discounted mdps
Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly minima x optimal reinforcement learning for discounted mdps. Advances in Neural Information Processing Systems , 34:22288–22300, 2021
work page 2021
-
[16]
Mo mentum-based policy gradient methods
Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Mo mentum-based policy gradient methods. In International conference on machine learning , pages 4422–4433. PMLR, 2020
work page 2020
-
[17]
Accelerating stochastic gradient descent for least squares regression
Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Ne trapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory , pages 545–
-
[18]
Provably efficient reinforcement learning with linear function approximation
Chi Jin, Zhuoran Y ang, Zhaoran W ang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research , pages 2137–2143. PMLR, 09–12 Jul 2020
work page 2020
-
[19]
Finite-sample analysis of two-time-scale natural actor–critic algorithm
Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Si va Theja Maguluri. Finite-sample analysis of two-time-scale natural actor–critic algorithm. IEEE T ransactions on Automatic Control , 68(6):3273– 3284, 2022
work page 2022
-
[20]
Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity , and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023
work page 2023
-
[21]
Learning policies with zero or bounded constraint violation for constrained mdps
T ao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kuma r, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021
work page 2021
-
[22]
Policy optimization for constrained mdps with provable fast global convergence
T ao Liu, Ruida Zhou, Dileep Kalathil, PR Kumar, and Chao Tian. Policy optimization for constrained mdps with provable fast global convergence. arXiv preprint arXiv:2111.00552 , 2021
-
[23]
An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods
Y anli Liu, Kaiqing Zhang, T amer Basar, and W otao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020
work page 2020
-
[24]
Saeed Masiha, Saber Salehkaleybar, Niao He, Negar Kiya vash, and Patrick Thiran. Stochastic second- order methods improve best-known sample complexity of sgd f or gradient-dominated functions. Ad- vances in Neural Information Processing Systems , 35:10862–10875, 2022
work page 2022
-
[25]
W ashim Uddin Mondal and V aneet Aggarwal. Improved samp le complexity analysis of natural pol- icy gradient algorithm with general parameterization for i nfinite horizon discounted reward Markov decision processes. In International Conference on Artificial Intelligence and St atistics, pages 3097–3105. PMLR, 2024
work page 2024
-
[26]
Sample-effici ent constrained reinforcement learning with general parameterization
W ashim Uddin Mondal and V aneet Aggarwal. Sample-effici ent constrained reinforcement learning with general parameterization. arXiv preprint arXiv:2405.10624 , 2024. 12
-
[27]
Last-iterate global convergence of policy gradients for constrained rei nforcement learning
Alessandro Montenegro, Marco Mussi, Matteo Papini, an d Alberto Maria Metelli. Last-iterate global convergence of policy gradients for constrained rei nforcement learning. arXiv preprint arXiv:2407.10775, 2024
-
[28]
Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qia n, and Chao Mi. Hessian aided policy gradient. In International conference on machine learning , pages 5729–5738. PMLR, 2019
work page 2019
-
[29]
Policy gradient methods for reinforcement learning with function approximation
Richard S Sutton, David McAllester, Satinder Singh, an d Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems , 12, 1999
work page 1999
-
[30]
Near- optimal sample complexity bounds for con- strained mdps
Sharan V aswani, Lin Y ang, and Csaba Szepesv´ ari. Near- optimal sample complexity bounds for con- strained mdps. Advances in Neural Information Processing Systems , 35:3110–3122, 2022
work page 2022
-
[31]
N eural policy gradient methods: Global optimality and rates of convergence, 2019
Lingxiao W ang, Qi Cai, Zhuoran Y ang, and Zhaoran W ang. N eural policy gradient methods: Global optimality and rates of convergence, 2019
work page 2019
-
[32]
Honghao W ei, Xin Liu, and Lei Ying. Triple-q: A model-fr ee algorithm for constrained reinforcement learning with sublinear regret and zero constraint violati on. In International Conference on Artificial Intelligence and Statistics , pages 3274–3307. PMLR, 2022
work page 2022
-
[33]
An improved conver gence analysis of stochastic variance- reduced policy gradient
Pan Xu, Felicia Gao, and Quanquan Gu. An improved conver gence analysis of stochastic variance- reduced policy gradient. In Uncertainty in Artificial Intelligence , pages 541–551. PMLR, 2020
work page 2020
-
[34]
Crpo: A new a pproach for safe reinforcement learning with convergence guarantee
T engyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new a pproach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning , pages 11480–11491, 2021
work page 2021
-
[35]
Sample-optimal parametric q- learning using linearly additive features
Lin Y ang and Mengdi W ang. Sample-optimal parametric q- learning using linearly additive features. In International conference on machine learning , pages 6995–7004. PMLR, 2019
work page 2019
-
[36]
A dual appro ach to constrained markov decision processes with entropy regularization
Donghao Ying, Yuhao Ding, and Javad Lavaei. A dual appro ach to constrained markov decision processes with entropy regularization. In International Conference on Artificial Intelligence and St atistics, pages 1887–1909. PMLR, 2022
work page 1909
-
[37]
Sihan Zeng, Thinh T Doan, and Justin Romberg. Finite-ti me complexity of online primal-dual natu- ral actor-critic algorithm for constrained markov decisio n processes. In 2022 IEEE 61st Conference on Decision and Control (CDC) , pages 4028–4033. IEEE, 2022
work page 2022
-
[38]
W enhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jaso n D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A general ized framework with linear convergence. SIAM Journal on Optimization , 33(2):1061–1091, 2023
work page 2023
-
[39]
On the convergence and sam- ple efficiency of variance-reduced policy gradient method
Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi W a ng, et al. On the convergence and sam- ple efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021
work page 2021
-
[40]
On the convergence and sample efficiency of variance-reduced policy gradient m ethod
Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, and Mengdi W ang. On the convergence and sample efficiency of variance-reduced policy gradient m ethod. Advances in Neural Information Processing Systems, 34:2228–2240, 2021. 13 A Proof of Lemma 2 Fix a tuple (θ,λ,τ ). The following proof assumes that the policy and all relevan t value functions are di...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.