Value Mirror Descent for Reinforcement Learning
Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3
The pith
Value mirror descent folds mirror descent steps into value iteration to reach near-optimal sample complexity while keeping policy Bregman divergence bounded.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By embedding a mirror descent step inside each value iteration cycle, the algorithm attains the near-optimal sample complexity Õ(|S||A|(1-γ)^{-3}ε^{-2}) for general convex regularizers under a generative model and simultaneously keeps the Bregman divergence between the generated policies and the optimal policy bounded throughout the run, a property that enables effective online learning after offline training.
What carries the argument
The value mirror descent update, which applies a Bregman-divergence-based mirror descent step to the current value function inside the standard value iteration recursion.
If this is right
- SVMD attains the near-optimal sample complexity Õ(|S||A|(1-γ)^{-3}ε^{-2}) for general convex regularizers.
- The Bregman divergence to the optimal policy stays bounded, enabling direct transition to continual online learning.
- Under a strongly convex regularizer the sample complexity improves to Õ(|S||A|(1-γ)^{-5}ε^{-1}) in the high-accuracy regime.
- The generated policy converges to the optimal policy when the regularizer is strongly convex.
Where Pith is reading between the lines
- The bounded-divergence property may permit hybrid pipelines that finish offline SVMD training and then continue with online policy-gradient updates without policy reinitialization.
- The deterministic linear convergence rate suggests that momentum or Nesterov acceleration could be inserted inside the mirror descent step to obtain faster deterministic convergence.
- Choosing a regularizer whose geometry matches the MDP transition structure could potentially reduce the explicit |S||A| dependence that remains in the stated bounds.
Load-bearing premise
The analysis requires a generative sampling model that can draw independent transitions for any chosen state-action pair, together with costs bounded in [0,1].
What would settle it
Run SVMD on a small tabular MDP with known optimum and observe whether the Bregman divergence between the iterate policy and the optimal policy grows without bound as the number of iterations increases under stochastic sampling.
read the original abstract
Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor $\gamma\in(0,1)$ and costs in $[0,1]$. We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we develop a stochastic variant, SVMD, which incorporates variance reduction commonly used in stochastic value iteration-type methods. For RL problems with general convex regularizers, SVMD attains a near-optimal sample complexity of $\tilde{O}(|S||A|(1-\gamma)^{-3}\epsilon^{-2})$. Moreover, we establish that the Bregman divergence between the generated and optimal policies remains bounded throughout the iterations. This property is absent in existing stochastic value iteration-type methods but is important for enabling effective online (continual) learning following offline training. Under a strongly convex regularizer, SVMD achieves sample complexity of $\tilde{O}(|S||A|(1-\gamma)^{-5}\epsilon^{-1})$, improving performance in the high-accuracy regime. Furthermore, we prove convergence of the generated policy to the optimal policy. Overall, the proposed method, its analysis, and the resulting guarantees, constitute new contributions to the RL and optimization literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Value Mirror Descent (VMD) integrating mirror descent into value iteration for discounted MDPs with costs in [0,1]. It proves linear convergence for deterministic VMD with known transitions, and for stochastic SVMD with variance reduction under a generative model, it claims near-optimal sample complexity Õ(|S||A|(1-γ)^{-3}ε^{-2}) for general convex regularizers (improved to Õ(|S||A|(1-γ)^{-5}ε^{-1}) for strongly convex), bounded Bregman divergence D_ψ(π_t, π*) for all iterations, and convergence of the generated policy to the optimal policy.
Significance. If the central claims hold, the work offers a new bridge between mirror descent and value iteration that achieves sample complexities competitive with or better than existing stochastic value iteration methods in the discount factor dependence, while adding the bounded Bregman divergence property absent from prior SV I-type algorithms. This property is highlighted as enabling effective online/continual learning after offline training. The deterministic linear convergence and policy convergence results are also positive contributions to the RL-optimization literature.
major comments (2)
- [Abstract and SVMD analysis] Abstract and the SVMD analysis (likely §4-5): The sample complexity Õ(|S||A|(1-γ)^{-3}ε^{-2}) and the claim that Bregman divergence D_ψ(π_t, π*) remains bounded for all t both rest on the generative oracle supplying independent transitions for every (s,a) and on the variance-reduction estimator controlling bias/variance uniformly across iterations. The telescoping argument used to bound the Bregman term (as flagged in the stress-test) does not automatically transfer if either modeling choice fails, and the manuscript must explicitly verify that the VR step-size schedule cancels the mirror-map-dependent noise term without degrading the (1-γ) or ε exponents.
- [Deterministic VMD section] Deterministic VMD section (likely §3): The linear convergence statement is given, but the rate constant must be derived explicitly in terms of the regularizer's strong-convexity or smoothness parameters to confirm it is indeed linear (rather than sublinear) and to allow direct comparison with standard value iteration rates.
minor comments (2)
- [Abstract] The abstract uses both ε and (1-γ) without defining the tilde-O notation or the precise meaning of 'near-optimal'; add a short clarification or reference to standard RL sample-complexity conventions.
- [Preliminaries] Bregman divergence D_ψ should be defined at first use (early in §2) rather than assumed known, to improve accessibility for RL readers.
Simulated Author's Rebuttal
We thank the referee for the thorough review and insightful comments, which help us clarify and strengthen the presentation of our results. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract and SVMD analysis] Abstract and the SVMD analysis (likely §4-5): The sample complexity Õ(|S||A|(1-γ)^{-3}ε^{-2}) and the claim that Bregman divergence D_ψ(π_t, π*) remains bounded for all t both rest on the generative oracle supplying independent transitions for every (s,a) and on the variance-reduction estimator controlling bias/variance uniformly across iterations. The telescoping argument used to bound the Bregman term (as flagged in the stress-test) does not automatically transfer if either modeling choice fails, and the manuscript must explicitly verify that the VR step-size schedule cancels the mirror-map-dependent noise term without degrading the (1-γ) or ε exponents.
Authors: We agree that the sample complexity and bounded Bregman divergence claims rely on the generative model and the specific variance-reduction estimator. In the proof of Theorem 5.1 (SVMD analysis), the step-size schedule η_t = Θ(1/(t + 1)) combined with the VR estimator ensures that the mirror-map-dependent noise terms telescope and cancel uniformly across iterations, preserving the (1-γ)^{-3} and ε^{-2} exponents. The boundedness of D_ψ(π_t, π*) follows directly from this cancellation without additional assumptions beyond the generative oracle. To address the concern, we will add an explicit remark and a short lemma in the revised §5 verifying the cancellation step and confirming that the exponents are unaffected. revision: yes
-
Referee: [Deterministic VMD section] Deterministic VMD section (likely §3): The linear convergence statement is given, but the rate constant must be derived explicitly in terms of the regularizer's strong-convexity or smoothness parameters to confirm it is indeed linear (rather than sublinear) and to allow direct comparison with standard value iteration rates.
Authors: We thank the referee for this observation. In Section 3, the linear convergence of deterministic VMD is established in Theorem 3.1 with rate depending on the strong-convexity parameter μ of the regularizer ψ and the discount factor γ. The contraction factor is explicitly (1 - μ(1-γ)/L) where L is the smoothness constant of the mirror map; this is strictly less than 1 for μ > 0, confirming linear (not sublinear) convergence. We will expand the statement of Theorem 3.1 and add a short derivation of the rate constant in terms of μ and L in the revised manuscript to enable direct comparison with classical value iteration. revision: yes
Circularity Check
No circularity: claims rest on standard convex analysis and generative-model assumptions without self-referential reduction
full rationale
The abstract and provided excerpts present VMD/SVMD as integrating mirror descent into value iteration, with sample-complexity bounds derived under a generative sampling model and variance reduction. No equations are shown that define a quantity in terms of itself (e.g., no fitted parameter renamed as prediction, no self-citation chain invoked as uniqueness theorem, no ansatz smuggled via prior work by the same authors). The Bregman-divergence boundedness claim is presented as a new property proved from the algorithm's update rule and the generative oracle, not as a tautology. The reader's assessment of score 2 is consistent with the absence of load-bearing self-referential steps in the visible derivation outline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Discounted MDP with finite S and A, costs in [0,1], generative sampling model available
Reference graph
Works this paper leans on
-
[1]
Model-based reinforcement learning with a generative model is minimax optimal
Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. InConference on Learning Theory, pages 67–83. PMLR, 2020
work page 2020
-
[2]
A markovian decision process.Journal of mathematics and mechanics, 6(5):679–684, 1957
Richard Bellman. A markovian decision process.Journal of mathematics and mechanics, 6(5):679–684, 1957
work page 1957
-
[3]
Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods.Journal of Control Theory and Applications, 9(3):310–335, 2011
work page 2011
-
[4]
Yichen Chen and Mengdi Wang. Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516, 2016
-
[5]
A theory of regularized markov decision processes
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InInterna- tional conference on machine learning, pages 2160–2169. PMLR, 2019
work page 2019
-
[6]
Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013
work page 2013
-
[7]
Reinforcement learning with deep energy-based policies
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017
work page 2017
-
[8]
Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963
work page 1963
-
[9]
Ronald A Howard.Dynamic programming and markov processes.John Wiley, 1960
work page 1960
-
[10]
Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715, 2022
-
[11]
Caleb Ju and Guanghui Lan. Strongly-polynomial time and validation analysis of policy gradient methods.arXiv preprint arXiv:2409.19437, 2024
-
[12]
Caleb Ju and Guanghui Lan. Auto-exploration for online reinforcement learning.arXiv preprint arXiv:2512.06244, 2025
-
[13]
Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, i: Operator extrapolation.SIAM Journal on Optimization, 32(3):2041–2073, 2022
work page 2041
-
[14]
Least-squares policy iteration.Journal of machine learning research, 4(Dec):1107–1149, 2003
Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration.Journal of machine learning research, 4(Dec):1107–1149, 2003
work page 2003
-
[15]
Guanghui Lan.First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020
work page 2020
-
[16]
Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical programming, 198(1):1059–1106, 2023
work page 2023
-
[17]
Tianjiao Li, Guanghui Lan, and Ashwin Pananjady. Accelerated and instance-optimal policy evaluation with linear function approximation.SIAM Journal on Mathematics of Data Science, 5(1):174–200, 2023
work page 2023
-
[18]
Tianjiao Li, Feiyang Wu, and Guanghui Lan. Stochastic first-order methods for average-reward markov decision pro- cesses.Mathematics of Operations Research, 50(4):3125–3160, 2025
work page 2025
-
[19]
Yan Li and Guanghui Lan. Policy mirror descent inherently explores action space.SIAM Journal on Optimization, 35(1):116–156, 2025
work page 2025
-
[20]
First-Order Policy Optimization for Robust Markov Decision Process
Yan Li, Guanghui Lan, and Tuo Zhao. First-order policy optimization for robust markov decision process.arXiv preprint arXiv:2209.10579, 2022
-
[21]
Pascal Massart.Concentration inequalities and model selection: Ecole d’Et´ e de Probabilit´ es de Saint-Flour XXXIII-
-
[22]
Influence and variance of a markov chain: Application to adaptive discretization in optimal control
R´ emi Munos and Andrew Moore. Influence and variance of a markov chain: Application to adaptive discretization in optimal control. InProceedings of the 38th IEEE Conference on Decision and Control (Cat. No. 99CH36304), volume 2, pages 1464–1469. IEEE, 1999
work page 1999
-
[23]
Ashwin Pananjady and Martin J Wainwright. Instance-dependentl ∞-bounds for policy evaluation in tabular reinforce- ment learning.IEEE Transactions on Information Theory, 67(1):566–585, 2020
work page 2020
-
[24]
Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
work page 2014
-
[25]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953
Lloyd S Shapley. Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953
work page 1953
-
[28]
Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[29]
Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes.Naval Research Logistics (NRL), 70(5):423–442, 2023
work page 2023
-
[30]
Matthew J Sobel. The variance of discounted markov decision processes.Journal of Applied Probability, 19(4):794–802, 1982
work page 1982
-
[31]
Primal-dualπlearning: Sample complexity and sublinear run time for ergodic markov decision problems
Mengdi Wang. Primal-dualπlearning: Sample complexity and sublinear run time for ergodic markov decision problems. arXiv preprint arXiv:1710.06100, 2017
-
[32]
Q-learning.Machine learning, 8(3):279–292, 1992
Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992
work page 1992
-
[33]
Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence.SIAM Journal on Optimization, 33(2):1061–1091, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.