Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

Haipeng Luo; Liyu Chen; Rahul Jain

arxiv: 2202.00150 · v1 · pith:EC3HYM6Unew · submitted 2022-01-31 · 💻 cs.LG

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

Liyu Chen , Rahul Jain , Haipeng Luo This is my paper

classification 💻 cs.LG

keywords algorithmmdpsregretwidetildeconstraintconstraintsviolationaverage-reward

0 comments

read the original abstract

We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $\widetilde{O}(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $\widetilde{O}(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $\widetilde{O}(T^{2/3})$ regret and constraint violation, which can be further improved to $\widetilde{O}(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stationary Robust Mean-Field Games under Model Mismatches
cs.LG 2026-06 unverdicted novelty 6.0

Develops infinite-horizon stationary robust mean-field games incorporating distributional uncertainty, proves equilibrium existence via fixed-point on contractive Bellman operator, gives convergent algorithm, and deri...