Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Vaneet Aggarwal; Washim Uddin Mondal

arxiv: 2408.11513 · v2 · submitted 2024-08-21 · 💻 cs.LG · cs.AI

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Washim Uddin Mondal , Vaneet Aggarwal This is my paper

Pith reviewed 2026-05-23 21:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords constrained MDPslast-iterate convergenceparameterized policiesnatural policy gradientprimal-dual methodssample complexityregularized algorithmspolicy optimization

0 comments

The pith

PDR-ANPG reaches last-iterate ε optimality and constraint satisfaction for general parameterized policies in constrained MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the PDR-ANPG algorithm, a primal-dual regularized accelerated natural policy gradient method, to solve constrained Markov decision processes when policies come from a general parameterized class. It proves last-iterate convergence guarantees that depend on a transferred compatibility approximation error ε_bias of the policy class. These rates matter because they guarantee that the policy returned at the end of training is nearly optimal and feasible, rather than only an average over the run, and they improve when the class is incomplete provided the target accuracy is not too fine. The results cover both biased and unbiased policy classes with explicit sample-complexity bounds.

Core claim

For parameterized policy classes with a transferred compatibility approximation error ε_bias, PDR-ANPG achieves a last-iterate ε optimality gap and ε constraint violation with a sample complexity of Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}). If the class is incomplete (ε_bias > 0), the sample complexity reduces to Õ(ε^{-2}) for ε < (ε_bias)^{1/6}. For complete policies with ε_bias = 0, the algorithm achieves the same last-iterate guarantees with Õ(ε^{-4}) sample complexity. This improves prior last-iterate results for general parameterized CMDPs.

What carries the argument

The PDR-ANPG algorithm, which combines primal-dual updates with entropy and quadratic regularizers to produce last-iterate convergence under a transferred compatibility approximation error.

If this is right

When ε_bias > 0 the sample complexity improves to Õ(ε^{-2}) once the target accuracy ε falls below ε_bias to the power 1/6.
When the policy class is complete and ε_bias = 0 the algorithm requires only Õ(ε^{-4}) samples to reach last-iterate ε optimality and feasibility.
All guarantees are stated for last-iterate rather than averaged iterates.
The bounds constitute an improvement over existing last-iterate sample-complexity results for general parameterized CMDPs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In practice one could deliberately choose a policy class whose ε_bias is moderate and then stop at an accuracy level matched to that bias to obtain the cheaper Õ(ε^{-2}) regime.
The explicit dependence on ε_bias offers a way to trade off policy-class expressivity against sample cost in applied constrained RL problems.
If similar transferred compatibility errors can be defined for other constrained sequential decision problems, the same primal-dual regularized analysis may yield last-iterate rates there as well.

Load-bearing premise

The transferred compatibility approximation error ε_bias is a well-defined, bounded quantity for the chosen policy class that can be controlled independently of the optimization process.

What would settle it

A direct computation or simulation on a simple CMDP that measures whether the last-iterate optimality gap and constraint violation decrease at the stated rates when ε_bias is known and fixed.

read the original abstract

This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, $\epsilon_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($\epsilon_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(\epsilon^{-2})$ for $\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $\epsilon_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity. It is a significant improvement over the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives last-iterate rates for general parameterized CMDPs that improve on prior work via an explicit ε_bias term, but the rates stand or fall on whether that term stays fixed and independent of the iterates.

read the letter

The main takeaway is that this work extends last-iterate convergence to general parameterized policies in constrained MDPs and supplies sample-complexity bounds that depend on a transferred compatibility approximation error ε_bias. The PDR-ANPG algorithm adds entropy and quadratic regularizers to a primal-dual natural policy gradient update. For classes with ε_bias > 0 the claimed rate is Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}), which simplifies to Õ(ε^{-2}) once ε drops below ε_bias^{1/6}. When the class is complete and ε_bias = 0 the rate becomes Õ(ε^{-4}). These expressions are not in the earlier last-iterate CMDP literature the abstract cites, so the dependence on ε_bias and the specific exponents count as new.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm for solving constrained MDPs with general parameterized policies. It introduces a transferred compatibility approximation error ε_bias and claims last-iterate ε-optimality and ε-constraint violation with sample complexity Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}), which simplifies to Õ(ε^{-2}) when ε < (ε_bias)^{1/6} for incomplete classes (ε_bias > 0) and Õ(ε^{-4}) for complete classes (ε_bias = 0). These are presented as improvements over prior last-iterate results for parameterized CMDPs.

Significance. If the central claims hold, the work would advance last-iterate analysis for constrained RL with function approximation, providing explicit rates that degrade gracefully with policy-class incompleteness. The explicit dependence on ε_bias and the regime distinctions are potentially useful for practitioners selecting policy classes, though the improvement is incremental relative to existing primal-dual analyses in CMDPs.

major comments (2)

[Abstract and definition of ε_bias] The headline rates in the abstract rest on treating ε_bias as a static, a-priori bounded constant decoupled from the primal-dual iterates. The transferred compatibility condition used to define ε_bias must be shown (in the section introducing the error term and in the convergence analysis) to remain independent of the current policy parameters and dual variables; otherwise the min{ε^{-2}, ε_bias^{-1/3}} and the ε < (ε_bias)^{1/6} regime do not follow.
[Main theorem / convergence analysis] The sample-complexity statements claim last-iterate guarantees, yet the proof sketch (or main theorem) must explicitly reduce the bias term to the stated rates without additional trajectory-dependent factors. If the analysis invokes standard RL assumptions plus ε_bias without a self-contained bound on how ε_bias enters the last-iterate Lyapunov function, the Õ(ε^{-4}) claim for ε_bias = 0 and the improved incomplete-class rate are not yet load-bearing.

minor comments (2)

[Preliminaries] Notation for the regularizers (entropy and quadratic) and the transferred compatibility condition should be introduced with explicit equations before the main theorem.
[Abstract / Related work] The abstract states 'significant improvement over the state-of-the-art'; a table comparing the new rates to the cited prior last-iterate bounds would clarify the precise improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. Below we respond point by point to the major comments, clarifying the independence of ε_bias and the structure of the convergence analysis. We will revise the manuscript to make these aspects more explicit.

read point-by-point responses

Referee: [Abstract and definition of ε_bias] The headline rates in the abstract rest on treating ε_bias as a static, a-priori bounded constant decoupled from the primal-dual iterates. The transferred compatibility condition used to define ε_bias must be shown (in the section introducing the error term and in the convergence analysis) to remain independent of the current policy parameters and dual variables; otherwise the min{ε^{-2}, ε_bias^{-1/3}} and the ε < (ε_bias)^{1/6} regime do not follow.

Authors: We agree that an explicit statement is warranted. In the manuscript, ε_bias is introduced in Section 3 (Definition 3.2) as the worst-case transferred compatibility error over the entire policy class and all dual variables in [0, λ_max]; it is therefore a fixed property of the policy class and does not depend on the current primal-dual iterates. The convergence analysis in the appendix already uses this uniform bound. To address the concern directly, we will add a short lemma in the revised main text (new Lemma 3.3) that formally proves independence from the iterates, thereby justifying the stated rates and the regime distinction. revision: yes
Referee: [Main theorem / convergence analysis] The sample-complexity statements claim last-iterate guarantees, yet the proof sketch (or main theorem) must explicitly reduce the bias term to the stated rates without additional trajectory-dependent factors. If the analysis invokes standard RL assumptions plus ε_bias without a self-contained bound on how ε_bias enters the last-iterate Lyapunov function, the Õ(ε^{-4}) claim for ε_bias = 0 and the improved incomplete-class rate are not yet load-bearing.

Authors: The proof of the main result (Theorem 4.1) and its supporting lemmas in Appendix B construct a last-iterate Lyapunov function that incorporates the bias term as an additive constant bounded by C·ε_bias (where C is independent of the trajectory length). The analysis then balances the regularization, primal-dual, and bias terms to obtain the claimed rates without introducing extra trajectory-dependent factors beyond the standard bounded-variance and smoothness assumptions already stated. We will insert a concise paragraph immediately after the statement of Theorem 4.1 that summarizes this reduction step, making the dependence on ε_bias fully explicit in the main text. revision: partial

Circularity Check

0 steps flagged

No circularity: ε_bias treated as exogenous class property

full rationale

The paper defines ε_bias as a transferred compatibility approximation error of the parameterized policy class, presented as a static, a priori bounded quantity decoupled from the primal-dual iterates. Sample-complexity claims (Õ(ε^{-2} min{ε^{-2}, ε_bias^{-1/3}}) and reductions for ε < (ε_bias)^{1/6}) are stated under this assumption without any quoted reduction of the rates to fitted quantities, self-citations, or redefinitions that would make the result tautological. The derivation chain therefore remains self-contained against the stated assumptions and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Analysis relies on standard CMDP assumptions (finite spaces, discounting) and the definition of transferred compatibility error; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption The underlying CMDP satisfies the usual technical conditions (finite state-action spaces, proper discounting, bounded costs) required for policy-gradient analysis.
Invoked implicitly to obtain any sample-complexity bound in CMDP literature.
domain assumption The transferred compatibility approximation error ε_bias is a fixed, policy-class-dependent constant that can be bounded independently of the learning process.
Central to all stated rates; appears in the abstract as the key quantity modulating complexity.

pith-pipeline@v0.9.0 · 5751 in / 1431 out tokens · 24151 ms · 2026-05-23T21:27:08.657122+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
cs.LG 2026-05 unverdicted novelty 6.0

An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper

[1]

On the theory of policy gradient methods: Optimality , approximation, and distribution shi ft

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Maha jan. On the theory of policy gradient methods: Optimality , approximation, and distribution shi ft. The Journal of Machine Learning Research , 22(1):4431–4506, 2021

work page 2021
[2]

Achieving zero constraint violation for constrained reinforcement learn ing via primal-dual approach

Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel , and V aneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learn ing via primal-dual approach. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 36, pages 3682–3689, 2022

work page 2022
[3]

Achiev ing zero constraint violation for con- strained reinforcement learning via conservative natural policy gradient primal-dual algorithm

Qinbo Bai, Amrit Singh Bedi, and V aneet Aggarwal. Achiev ing zero constraint violation for con- strained reinforcement learning via conservative natural policy gradient primal-dual algorithm. In Proceedings of the AAAI Conference on Artiﬁcial Intelligen ce, volume 37, pages 6737–6744, 2023

work page 2023
[4]

Regret analysis of policy gradient algorithm for inﬁnite horizon average reward markov decision process es

Qinbo Bai, W ashim Uddin Mondal, and V aneet Aggarwal. Regret analysis of policy gradient algorithm for inﬁnite horizon average reward markov decision process es. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 10980–10988, 2024

work page 2024
[5]

On the linear convergen ce of policy gradient methods for ﬁnite mdps

Jalaj Bhandari and Daniel Russo. On the linear convergen ce of policy gradient methods for ﬁnite mdps. In International Conference on Artiﬁcial Intelligence and St atistics, pages 2386–2394. PMLR, 2021

work page 2021
[6]

Fast global convergence of natural policy gradient methods with entropy regularization

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting W ei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022

work page 2022
[7]

Sample complexity o f policy-based methods under off-policy sampling and linear function approximation

Zaiwei Chen and Siva Theja Maguluri. Sample complexity o f policy-based methods under off-policy sampling and linear function approximation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 11195–11214. PMLR, 2022

work page 2022
[8]

Last-iterate convergent policy gradient primal-dual methods for constrained mdps

Dongsheng Ding, Chen-Yu W ei, Kaiqing Zhang, and Alejand ro Ribeiro. Last-iterate convergent policy gradient primal-dual methods for constrained mdps. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[9]

Provably efﬁ- cient safe exploration via primal-dual policy optimizatio n

Dongsheng Ding, Xiaohan W ei, Zhuoran Y ang, Zhaoran W ang, and Mihailo Jovanovic. Provably efﬁ- cient safe exploration via primal-dual policy optimizatio n. In International conference on artiﬁcial intelli- gence and statistics , pages 3304–3312. PMLR, 2021. 11

work page 2021
[10]

Natural policy gradient primal- dual method for constrained markov decision processes

Dongsheng Ding, Kaiqing Zhang, T amer Basar, and Mihailo Jovanovic. Natural policy gradient primal- dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020

work page 2020
[11]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Y onathan Efroni, Shie Mannor, and Matteo Pirotta. Expl oration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189 , 2020

work page arXiv 2003
[12]

Stochastic policy gradient methods: Improved sample complexity for ﬁsher-non-degenerate poli cies

Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, an d Niao He. Stochastic policy gradient methods: Improved sample complexity for ﬁsher-non-degenerate poli cies. In International Conference on Machine Learning, pages 9827–9869. PMLR, 2023

work page 2023
[13]

Page-pg: A simple and loopless variance-reduced policy gradient meth od with probabilistic gradient estimation

Matilde Gargiani, Andrea Zanelli, Andrea Martinelli, Tyler Summers, and John Lygeros. Page-pg: A simple and loopless variance-reduced policy gradient meth od with probabilistic gradient estimation. In International Conference on Machine Learning , pages 7223–7240. PMLR, 2022

work page 2022
[14]

Algorithm for constrained markov decisio n process with linear convergence

Egor Gladin, Maksim Lavrik-Karmazin, Karina Zainulli na, V arvara Rudenko, Alexander Gasnikov , and Martin T akac. Algorithm for constrained markov decisio n process with linear convergence. In International Conference on Artiﬁcial Intelligence and St atistics, pages 11506–11533. PMLR, 2023

work page 2023
[15]

Nearly minima x optimal reinforcement learning for discounted mdps

Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly minima x optimal reinforcement learning for discounted mdps. Advances in Neural Information Processing Systems , 34:22288–22300, 2021

work page 2021
[16]

Mo mentum-based policy gradient methods

Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Mo mentum-based policy gradient methods. In International conference on machine learning , pages 4422–4433. PMLR, 2020

work page 2020
[17]

Accelerating stochastic gradient descent for least squares regression

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Ne trapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory , pages 545–

work page
[18]

Provably efﬁcient reinforcement learning with linear function approximation

Chi Jin, Zhuoran Y ang, Zhaoran W ang, and Michael I Jordan. Provably efﬁcient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research , pages 2137–2143. PMLR, 09–12 Jul 2020

work page 2020
[19]

Finite-sample analysis of two-time-scale natural actor–critic algorithm

Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Si va Theja Maguluri. Finite-sample analysis of two-time-scale natural actor–critic algorithm. IEEE T ransactions on Automatic Control , 68(6):3273– 3284, 2022

work page 2022
[20]

Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity , and generalized problem classes

Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity , and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023

work page 2023
[21]

Learning policies with zero or bounded constraint violation for constrained mdps

T ao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kuma r, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021

work page 2021
[22]

Policy optimization for constrained mdps with provable fast global convergence

T ao Liu, Ruida Zhou, Dileep Kalathil, PR Kumar, and Chao Tian. Policy optimization for constrained mdps with provable fast global convergence. arXiv preprint arXiv:2111.00552 , 2021

work page arXiv 2021
[23]

An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods

Y anli Liu, Kaiqing Zhang, T amer Basar, and W otao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020

work page 2020
[24]

Stochastic second- order methods improve best-known sample complexity of sgd f or gradient-dominated functions

Saeed Masiha, Saber Salehkaleybar, Niao He, Negar Kiya vash, and Patrick Thiran. Stochastic second- order methods improve best-known sample complexity of sgd f or gradient-dominated functions. Ad- vances in Neural Information Processing Systems , 35:10862–10875, 2022

work page 2022
[25]

Improved samp le complexity analysis of natural pol- icy gradient algorithm with general parameterization for i nﬁnite horizon discounted reward Markov decision processes

W ashim Uddin Mondal and V aneet Aggarwal. Improved samp le complexity analysis of natural pol- icy gradient algorithm with general parameterization for i nﬁnite horizon discounted reward Markov decision processes. In International Conference on Artiﬁcial Intelligence and St atistics, pages 3097–3105. PMLR, 2024

work page 2024
[26]

Sample-efﬁci ent constrained reinforcement learning with general parameterization

W ashim Uddin Mondal and V aneet Aggarwal. Sample-efﬁci ent constrained reinforcement learning with general parameterization. arXiv preprint arXiv:2405.10624 , 2024. 12

work page arXiv 2024
[27]

Last-iterate global convergence of policy gradients for constrained rei nforcement learning

Alessandro Montenegro, Marco Mussi, Matteo Papini, an d Alberto Maria Metelli. Last-iterate global convergence of policy gradients for constrained rei nforcement learning. arXiv preprint arXiv:2407.10775, 2024

work page arXiv 2024
[28]

Hessian aided policy gradient

Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qia n, and Chao Mi. Hessian aided policy gradient. In International conference on machine learning , pages 5729–5738. PMLR, 2019

work page 2019
[29]

Policy gradient methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, an d Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems , 12, 1999

work page 1999
[30]

Near- optimal sample complexity bounds for con- strained mdps

Sharan V aswani, Lin Y ang, and Csaba Szepesv´ ari. Near- optimal sample complexity bounds for con- strained mdps. Advances in Neural Information Processing Systems , 35:3110–3122, 2022

work page 2022
[31]

N eural policy gradient methods: Global optimality and rates of convergence, 2019

Lingxiao W ang, Qi Cai, Zhuoran Y ang, and Zhaoran W ang. N eural policy gradient methods: Global optimality and rates of convergence, 2019

work page 2019
[32]

Triple-q: A model-fr ee algorithm for constrained reinforcement learning with sublinear regret and zero constraint violati on

Honghao W ei, Xin Liu, and Lei Ying. Triple-q: A model-fr ee algorithm for constrained reinforcement learning with sublinear regret and zero constraint violati on. In International Conference on Artiﬁcial Intelligence and Statistics , pages 3274–3307. PMLR, 2022

work page 2022
[33]

An improved conver gence analysis of stochastic variance- reduced policy gradient

Pan Xu, Felicia Gao, and Quanquan Gu. An improved conver gence analysis of stochastic variance- reduced policy gradient. In Uncertainty in Artiﬁcial Intelligence , pages 541–551. PMLR, 2020

work page 2020
[34]

Crpo: A new a pproach for safe reinforcement learning with convergence guarantee

T engyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new a pproach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning , pages 11480–11491, 2021

work page 2021
[35]

Sample-optimal parametric q- learning using linearly additive features

Lin Y ang and Mengdi W ang. Sample-optimal parametric q- learning using linearly additive features. In International conference on machine learning , pages 6995–7004. PMLR, 2019

work page 2019
[36]

A dual appro ach to constrained markov decision processes with entropy regularization

Donghao Ying, Yuhao Ding, and Javad Lavaei. A dual appro ach to constrained markov decision processes with entropy regularization. In International Conference on Artiﬁcial Intelligence and St atistics, pages 1887–1909. PMLR, 2022

work page 1909
[37]

Finite-ti me complexity of online primal-dual natu- ral actor-critic algorithm for constrained markov decisio n processes

Sihan Zeng, Thinh T Doan, and Justin Romberg. Finite-ti me complexity of online primal-dual natu- ral actor-critic algorithm for constrained markov decisio n processes. In 2022 IEEE 61st Conference on Decision and Control (CDC) , pages 4028–4033. IEEE, 2022

work page 2022
[38]

Policy mirror descent for regularized reinforcement learning: A general ized framework with linear convergence

W enhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jaso n D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A general ized framework with linear convergence. SIAM Journal on Optimization , 33(2):1061–1091, 2023

work page 2023
[39]

On the convergence and sam- ple efﬁciency of variance-reduced policy gradient method

Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi W a ng, et al. On the convergence and sam- ple efﬁciency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021

work page 2021
[40]

On the convergence and sample efﬁciency of variance-reduced policy gradient m ethod

Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, and Mengdi W ang. On the convergence and sample efﬁciency of variance-reduced policy gradient m ethod. Advances in Neural Information Processing Systems, 34:2228–2240, 2021. 13 A Proof of Lemma 2 Fix a tuple (θ,λ,τ ). The following proof assumes that the policy and all relevan t value functions are di...

work page 2021

[1] [1]

On the theory of policy gradient methods: Optimality , approximation, and distribution shi ft

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Maha jan. On the theory of policy gradient methods: Optimality , approximation, and distribution shi ft. The Journal of Machine Learning Research , 22(1):4431–4506, 2021

work page 2021

[2] [2]

Achieving zero constraint violation for constrained reinforcement learn ing via primal-dual approach

Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel , and V aneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learn ing via primal-dual approach. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 36, pages 3682–3689, 2022

work page 2022

[3] [3]

Achiev ing zero constraint violation for con- strained reinforcement learning via conservative natural policy gradient primal-dual algorithm

Qinbo Bai, Amrit Singh Bedi, and V aneet Aggarwal. Achiev ing zero constraint violation for con- strained reinforcement learning via conservative natural policy gradient primal-dual algorithm. In Proceedings of the AAAI Conference on Artiﬁcial Intelligen ce, volume 37, pages 6737–6744, 2023

work page 2023

[4] [4]

Regret analysis of policy gradient algorithm for inﬁnite horizon average reward markov decision process es

Qinbo Bai, W ashim Uddin Mondal, and V aneet Aggarwal. Regret analysis of policy gradient algorithm for inﬁnite horizon average reward markov decision process es. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 10980–10988, 2024

work page 2024

[5] [5]

On the linear convergen ce of policy gradient methods for ﬁnite mdps

Jalaj Bhandari and Daniel Russo. On the linear convergen ce of policy gradient methods for ﬁnite mdps. In International Conference on Artiﬁcial Intelligence and St atistics, pages 2386–2394. PMLR, 2021

work page 2021

[6] [6]

Fast global convergence of natural policy gradient methods with entropy regularization

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting W ei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022

work page 2022

[7] [7]

Sample complexity o f policy-based methods under off-policy sampling and linear function approximation

Zaiwei Chen and Siva Theja Maguluri. Sample complexity o f policy-based methods under off-policy sampling and linear function approximation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 11195–11214. PMLR, 2022

work page 2022

[8] [8]

Last-iterate convergent policy gradient primal-dual methods for constrained mdps

Dongsheng Ding, Chen-Yu W ei, Kaiqing Zhang, and Alejand ro Ribeiro. Last-iterate convergent policy gradient primal-dual methods for constrained mdps. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[9] [9]

Provably efﬁ- cient safe exploration via primal-dual policy optimizatio n

Dongsheng Ding, Xiaohan W ei, Zhuoran Y ang, Zhaoran W ang, and Mihailo Jovanovic. Provably efﬁ- cient safe exploration via primal-dual policy optimizatio n. In International conference on artiﬁcial intelli- gence and statistics , pages 3304–3312. PMLR, 2021. 11

work page 2021

[10] [10]

Natural policy gradient primal- dual method for constrained markov decision processes

Dongsheng Ding, Kaiqing Zhang, T amer Basar, and Mihailo Jovanovic. Natural policy gradient primal- dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020

work page 2020

[11] [11]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Y onathan Efroni, Shie Mannor, and Matteo Pirotta. Expl oration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189 , 2020

work page arXiv 2003

[12] [12]

Stochastic policy gradient methods: Improved sample complexity for ﬁsher-non-degenerate poli cies

Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, an d Niao He. Stochastic policy gradient methods: Improved sample complexity for ﬁsher-non-degenerate poli cies. In International Conference on Machine Learning, pages 9827–9869. PMLR, 2023

work page 2023

[13] [13]

Page-pg: A simple and loopless variance-reduced policy gradient meth od with probabilistic gradient estimation

Matilde Gargiani, Andrea Zanelli, Andrea Martinelli, Tyler Summers, and John Lygeros. Page-pg: A simple and loopless variance-reduced policy gradient meth od with probabilistic gradient estimation. In International Conference on Machine Learning , pages 7223–7240. PMLR, 2022

work page 2022

[14] [14]

Algorithm for constrained markov decisio n process with linear convergence

Egor Gladin, Maksim Lavrik-Karmazin, Karina Zainulli na, V arvara Rudenko, Alexander Gasnikov , and Martin T akac. Algorithm for constrained markov decisio n process with linear convergence. In International Conference on Artiﬁcial Intelligence and St atistics, pages 11506–11533. PMLR, 2023

work page 2023

[15] [15]

Nearly minima x optimal reinforcement learning for discounted mdps

Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly minima x optimal reinforcement learning for discounted mdps. Advances in Neural Information Processing Systems , 34:22288–22300, 2021

work page 2021

[16] [16]

Mo mentum-based policy gradient methods

Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Mo mentum-based policy gradient methods. In International conference on machine learning , pages 4422–4433. PMLR, 2020

work page 2020

[17] [17]

Accelerating stochastic gradient descent for least squares regression

Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Ne trapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory , pages 545–

work page

[18] [18]

Provably efﬁcient reinforcement learning with linear function approximation

Chi Jin, Zhuoran Y ang, Zhaoran W ang, and Michael I Jordan. Provably efﬁcient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research , pages 2137–2143. PMLR, 09–12 Jul 2020

work page 2020

[19] [19]

Finite-sample analysis of two-time-scale natural actor–critic algorithm

Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Si va Theja Maguluri. Finite-sample analysis of two-time-scale natural actor–critic algorithm. IEEE T ransactions on Automatic Control , 68(6):3273– 3284, 2022

work page 2022

[20] [20]

Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity , and generalized problem classes

Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity , and generalized problem classes. Mathematical programming, 198(1):1059–1106, 2023

work page 2023

[21] [21]

Learning policies with zero or bounded constraint violation for constrained mdps

T ao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kuma r, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021

work page 2021

[22] [22]

Policy optimization for constrained mdps with provable fast global convergence

T ao Liu, Ruida Zhou, Dileep Kalathil, PR Kumar, and Chao Tian. Policy optimization for constrained mdps with provable fast global convergence. arXiv preprint arXiv:2111.00552 , 2021

work page arXiv 2021

[23] [23]

An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods

Y anli Liu, Kaiqing Zhang, T amer Basar, and W otao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020

work page 2020

[24] [24]

Stochastic second- order methods improve best-known sample complexity of sgd f or gradient-dominated functions

Saeed Masiha, Saber Salehkaleybar, Niao He, Negar Kiya vash, and Patrick Thiran. Stochastic second- order methods improve best-known sample complexity of sgd f or gradient-dominated functions. Ad- vances in Neural Information Processing Systems , 35:10862–10875, 2022

work page 2022

[25] [25]

Improved samp le complexity analysis of natural pol- icy gradient algorithm with general parameterization for i nﬁnite horizon discounted reward Markov decision processes

W ashim Uddin Mondal and V aneet Aggarwal. Improved samp le complexity analysis of natural pol- icy gradient algorithm with general parameterization for i nﬁnite horizon discounted reward Markov decision processes. In International Conference on Artiﬁcial Intelligence and St atistics, pages 3097–3105. PMLR, 2024

work page 2024

[26] [26]

Sample-efﬁci ent constrained reinforcement learning with general parameterization

W ashim Uddin Mondal and V aneet Aggarwal. Sample-efﬁci ent constrained reinforcement learning with general parameterization. arXiv preprint arXiv:2405.10624 , 2024. 12

work page arXiv 2024

[27] [27]

Last-iterate global convergence of policy gradients for constrained rei nforcement learning

Alessandro Montenegro, Marco Mussi, Matteo Papini, an d Alberto Maria Metelli. Last-iterate global convergence of policy gradients for constrained rei nforcement learning. arXiv preprint arXiv:2407.10775, 2024

work page arXiv 2024

[28] [28]

Hessian aided policy gradient

Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qia n, and Chao Mi. Hessian aided policy gradient. In International conference on machine learning , pages 5729–5738. PMLR, 2019

work page 2019

[29] [29]

Policy gradient methods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, an d Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems , 12, 1999

work page 1999

[30] [30]

Near- optimal sample complexity bounds for con- strained mdps

Sharan V aswani, Lin Y ang, and Csaba Szepesv´ ari. Near- optimal sample complexity bounds for con- strained mdps. Advances in Neural Information Processing Systems , 35:3110–3122, 2022

work page 2022

[31] [31]

N eural policy gradient methods: Global optimality and rates of convergence, 2019

Lingxiao W ang, Qi Cai, Zhuoran Y ang, and Zhaoran W ang. N eural policy gradient methods: Global optimality and rates of convergence, 2019

work page 2019

[32] [32]

Triple-q: A model-fr ee algorithm for constrained reinforcement learning with sublinear regret and zero constraint violati on

Honghao W ei, Xin Liu, and Lei Ying. Triple-q: A model-fr ee algorithm for constrained reinforcement learning with sublinear regret and zero constraint violati on. In International Conference on Artiﬁcial Intelligence and Statistics , pages 3274–3307. PMLR, 2022

work page 2022

[33] [33]

An improved conver gence analysis of stochastic variance- reduced policy gradient

Pan Xu, Felicia Gao, and Quanquan Gu. An improved conver gence analysis of stochastic variance- reduced policy gradient. In Uncertainty in Artiﬁcial Intelligence , pages 541–551. PMLR, 2020

work page 2020

[34] [34]

Crpo: A new a pproach for safe reinforcement learning with convergence guarantee

T engyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new a pproach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning , pages 11480–11491, 2021

work page 2021

[35] [35]

Sample-optimal parametric q- learning using linearly additive features

Lin Y ang and Mengdi W ang. Sample-optimal parametric q- learning using linearly additive features. In International conference on machine learning , pages 6995–7004. PMLR, 2019

work page 2019

[36] [36]

A dual appro ach to constrained markov decision processes with entropy regularization

Donghao Ying, Yuhao Ding, and Javad Lavaei. A dual appro ach to constrained markov decision processes with entropy regularization. In International Conference on Artiﬁcial Intelligence and St atistics, pages 1887–1909. PMLR, 2022

work page 1909

[37] [37]

Finite-ti me complexity of online primal-dual natu- ral actor-critic algorithm for constrained markov decisio n processes

Sihan Zeng, Thinh T Doan, and Justin Romberg. Finite-ti me complexity of online primal-dual natu- ral actor-critic algorithm for constrained markov decisio n processes. In 2022 IEEE 61st Conference on Decision and Control (CDC) , pages 4028–4033. IEEE, 2022

work page 2022

[38] [38]

Policy mirror descent for regularized reinforcement learning: A general ized framework with linear convergence

W enhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jaso n D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A general ized framework with linear convergence. SIAM Journal on Optimization , 33(2):1061–1091, 2023

work page 2023

[39] [39]

On the convergence and sam- ple efﬁciency of variance-reduced policy gradient method

Junyu Zhang, Chengzhuo Ni, Csaba Szepesvari, Mengdi W a ng, et al. On the convergence and sam- ple efﬁciency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228–2240, 2021

work page 2021

[40] [40]

On the convergence and sample efﬁciency of variance-reduced policy gradient m ethod

Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, and Mengdi W ang. On the convergence and sample efﬁciency of variance-reduced policy gradient m ethod. Advances in Neural Information Processing Systems, 34:2228–2240, 2021. 13 A Proof of Lemma 2 Fix a tuple (θ,λ,τ ). The following proof assumes that the policy and all relevan t value functions are di...

work page 2021