Natural policy gradient is a special case of doubly smoothed policy iteration that achieves distribution-free global geometric convergence to an epsilon-optimal policy in O((1-gamma)^{-1} log((1-gamma)^{-1} epsilon^{-1})) iterations.
First-Order Policy Optimization for Robust Markov Decision Process
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
RMDPs lack subgradient dominance in general and admit suboptimal local minima; finding epsilon-optimal policies is NP-hard for finite transition uncertainty sets, but the dominance property holds when worst-case kernels or action-values are unique per policy.
Presents the first algorithm to identify an ε-optimal policy in robust constrained MDPs via epigraph form and bisection search with Õ(ε^{-4}) robust policy evaluations.
Value mirror descent integrates mirror descent into value iteration for discounted MDPs, delivering near-optimal sample complexity of order |S||A|(1-γ)^{-3}ε^{-2} for general convex regularizers and bounded Bregman divergence between generated and optimal policies.
State augmentation allows dynamic programming and sample complexity bounds for MDPs and optimal control under static risk measures including CVaR.
citing papers explorer
-
Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework
Natural policy gradient is a special case of doubly smoothed policy iteration that achieves distribution-free global geometric convergence to an epsilon-optimal policy in O((1-gamma)^{-1} log((1-gamma)^{-1} epsilon^{-1})) iterations.
-
Revisiting Subgradient Dominance in Robust MDPs: Counterexamples, Hardness, and Sufficient Conditions
RMDPs lack subgradient dominance in general and admit suboptimal local minima; finding epsilon-optimal policies is NP-hard for finite transition uncertainty sets, but the dominance property holds when worst-case kernels or action-values are unique per policy.
-
Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Presents the first algorithm to identify an ε-optimal policy in robust constrained MDPs via epigraph form and bisection search with Õ(ε^{-4}) robust policy evaluations.
-
Value Mirror Descent for Reinforcement Learning
Value mirror descent integrates mirror descent into value iteration for discounted MDPs, delivering near-optimal sample complexity of order |S||A|(1-γ)^{-3}ε^{-2} for general convex regularizers and bounded Bregman divergence between generated and optimal policies.
-
Sample Complexity for Markov Decision Processes and Stochastic Optimal Control with Static Risk Measures
State augmentation allows dynamic programming and sample complexity bounds for MDPs and optimal control under static risk measures including CVaR.