Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

Mengdi Wang; Yichen Chen

arxiv: 1612.02516 · v1 · pith:OQXYG5RLnew · submitted 2016-12-08 · 📊 stat.ML · cs.AI· math.OC

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

Yichen Chen , Mengdi Wang This is my paper

classification 📊 stat.ML cs.AImath.OC

keywords mathcalmethodsepsilonpolicycomplexityfracleftoptimal

0 comments

read the original abstract

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute-$\epsilon$-optimal policy, with high probability, using $\mathcal{O}\left(\frac{|\mathcal{S}|^4 |\mathcal{A}|^2\sigma^2 }{(1-\gamma)^6\epsilon^2} \right)$ iterations/samples for the infinite-horizon discounted-reward MDP and $\mathcal{O}\left(\frac{|\mathcal{S}|^4 |\mathcal{A}|^2H^6\sigma^2 }{\epsilon^2} \right)$ for the finite-horizon MDP.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Value Mirror Descent for Reinforcement Learning
math.OC 2026-04 unverdicted novelty 5.0

Value mirror descent integrates mirror descent into value iteration for discounted MDPs, delivering near-optimal sample complexity of order |S||A|(1-γ)^{-3}ε^{-2} for general convex regularizers and bounded Bregman di...
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.