A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance
Pith reviewed 2026-05-22 16:16 UTC · model grok-4.3
The pith
PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP from a single trajectory of data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. The convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, the method achieves a last-iterate finite-time guarantee with Õ(k^{-2/3}) mean-square convergence.
What carries the argument
The two-timescale decomposition of the primal-dual optimization problem where the dual variable provides online guidance to the policy in the projected gradient descent-ascent scheme.
If this is right
- The algorithm learns from a single trajectory of correlated samples without a separate simulator.
- Convergence holds with weaker assumptions than prior primal-dual methods in RL.
- It achieves the best-known convergence rate for two-timescale stochastic approximation under Markovian sampling and biased gradients.
- The method integrates experience replay with asynchronous updates for practical RL.
Where Pith is reading between the lines
- This framework could be applied to other stochastic optimization problems involving nested objectives with biased estimates.
- Removing the fixed behavioral policy requirement may improve performance in environments with changing dynamics.
- The approach suggests new ways to balance exploration and exploitation in off-policy learning by using the dual variable dynamically.
Load-bearing premise
The proof depends on standard stochastic approximation results applying to this asynchronous two-timescale system with biased gradient estimates obtained from a single trajectory.
What would settle it
Observe the behavior of the algorithm on a small, solvable MDP and check if the value function and policy estimates converge to the known optimum as iterations increase when sampling from a single non-ergodic trajectory.
read the original abstract
We study reinforcement learning by combining recent advances in regularized linear programming formulations with the classical theory of stochastic approximation. Motivated by the challenge of designing algorithms that leverage off-policy data while maintaining on-policy exploration, we propose PGDA-RL, a novel primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs). PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem. The algorithm operates asynchronously, interacts with the environment through a single trajectory of correlated data, and updates its policy online in response to the dual variable associated with the occupancy measure of the underlying MDP. We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, we establish a last-iterate finite-time guarantee with $\tilde{O} (k^{-2/3})$ mean-square convergence, aligning with the best-known rates for two-timescale stochastic approximation methods under Markovian sampling and biased gradient estimates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PGDA-RL, a two-timescale primal-dual Projected Gradient Descent-Ascent algorithm for regularized MDPs in reinforcement learning. It integrates experience replay-based gradient estimation with asynchronous updates from a single trajectory and proves almost sure convergence to the optimal value function and policy using stochastic approximation theory under weaker assumptions than existing approaches, without requiring a simulator or fixed behavioral policy. It also provides a finite-time last-iterate mean-square convergence rate of Õ(k^{-2/3}) under a strengthened ergodicity assumption.
Significance. Should the convergence analysis hold, the work offers a significant advancement in primal-dual RL methods by enabling online dual variable guidance for policy updates using off-policy data from a single trajectory. This removes restrictive assumptions like fixed behavioral policies, potentially leading to more practical algorithms. The alignment with optimal rates for two-timescale SA methods under Markovian sampling is a positive aspect, and the use of regularized LP formulations is well-motivated.
major comments (2)
- [Convergence Analysis] The almost sure convergence claim depends on the applicability of standard stochastic approximation results (such as Borkar-type theorems for two-timescale systems) to the asynchronous updates with biased gradient estimates from a single correlated trajectory. The manuscript needs to explicitly show that the bias vanishes at a rate compatible with the step-size schedules and that the Markov chain satisfies the required ergodicity or mixing conditions without a fixed behavioral policy. This is load-bearing for the central claim of weaker assumptions.
- [Finite-time Guarantee] The strengthened ergodicity assumption introduced for the Õ(k^{-2/3}) mean-square convergence rate should be discussed in relation to the assumptions in existing primal-dual RL literature to confirm that the overall set of assumptions remains weaker. Without this comparison, the improvement in the finite-time result is difficult to evaluate.
minor comments (2)
- [Abstract] Clarify the exact meaning of the last-iterate finite-time guarantee and specify the dependence on the iteration index k.
- [Notation] Ensure consistent use of symbols for the dual variable and occupancy measure throughout the paper.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of our work's potential impact. We address each major comment below and will revise the manuscript accordingly to improve clarity.
read point-by-point responses
-
Referee: [Convergence Analysis] The almost sure convergence claim depends on the applicability of standard stochastic approximation results (such as Borkar-type theorems for two-timescale systems) to the asynchronous updates with biased gradient estimates from a single correlated trajectory. The manuscript needs to explicitly show that the bias vanishes at a rate compatible with the step-size schedules and that the Markov chain satisfies the required ergodicity or mixing conditions without a fixed behavioral policy. This is load-bearing for the central claim of weaker assumptions.
Authors: We appreciate the referee's emphasis on this foundational aspect. The proof of almost-sure convergence (Theorem 1 in Section 4) applies the two-timescale stochastic approximation framework of Borkar (2008) after establishing the requisite conditions in Lemmas 2--4. The bias from experience-replay gradient estimates is controlled by the replay buffer size and decays as O(1/k) under the chosen step-size schedules (α_k ∼ k^{-2/3}, β_k ∼ k^{-1/3}), which satisfies the summability requirements for a.s. convergence. The Markov chain is driven by the slowly varying policy; geometric ergodicity follows from the regularization ensuring a unique stationary distribution with mixing rate independent of any fixed behavior policy (see Assumption 3 and the discussion in Section 3.2). To address the request for explicit verification, we will insert a new subsection 4.1 that tabulates each Borkar condition and confirms its satisfaction, including the precise bias-decay rate. revision: partial
-
Referee: [Finite-time Guarantee] The strengthened ergodicity assumption introduced for the Õ(k^{-2/3}) mean-square convergence rate should be discussed in relation to the assumptions in existing primal-dual RL literature to confirm that the overall set of assumptions remains weaker. Without this comparison, the improvement in the finite-time result is difficult to evaluate.
Authors: We agree that an explicit comparison clarifies the contribution. The strengthened ergodicity condition (Assumption 4) is invoked solely for the finite-time last-iterate bound in Theorem 2 and is standard in the Markovian stochastic-approximation literature for attaining the optimal Õ(k^{-2/3}) rate. The almost-sure convergence result holds under the strictly weaker set of assumptions that dispense with both a simulator and a fixed behavioral policy. In the revision we will add a dedicated paragraph in Section 5 and a short comparison table (new Table 1) that contrasts Assumption 4 with the ergodicity conditions appearing in recent primal-dual RL works (e.g., those requiring a fixed behavior policy or i.i.d. sampling), thereby confirming that the core modeling assumptions remain weaker. revision: yes
Circularity Check
Convergence analysis invokes external stochastic approximation theorems without internal reduction
full rationale
The paper's almost-sure convergence result for PGDA-RL is derived by applying classical two-timescale stochastic approximation theorems (e.g., Borkar-type results) to the asynchronous primal-dual iterates with experience-replay gradients from a single trajectory. This structure does not reduce the claimed optimality or convergence to any quantity defined by the algorithm's own fitted parameters, self-referential definitions, or renamed empirical patterns. The finite-time bound under a strengthened ergodicity assumption similarly aligns with known rates for Markovian sampling and biased estimates, remaining independent of internal fits. No load-bearing self-citations or ansatzes smuggled via prior author work appear in the provided derivation chain; the analysis is self-contained against external mathematical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Strengthened ergodicity assumption on the underlying Markov chain
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-timescale decomposition... last-iterate finite-time guarantee with Õ(k^{-2/3})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.