A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance

Axel Friedrich Wolter; Tobias Sutter

arxiv: 2505.04494 · v3 · submitted 2025-05-07 · 🧮 math.OC · cs.LG

A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance

Axel Friedrich Wolter , Tobias Sutter This is my paper

Pith reviewed 2026-05-22 16:16 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords reinforcement learningprimal-dual optimizationstochastic approximationregularized MDPstwo-timescale methodsalmost sure convergenceonline dual guidance

0 comments

The pith

PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP from a single trajectory of data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops PGDA-RL, a primal-dual algorithm for regularized MDPs that combines projected gradient descent-ascent with a two-timescale structure. The dual variable for the occupancy measure guides the policy updates online while gradients are estimated asynchronously from experience replay on one trajectory of data. This design removes the need for a simulator or fixed behavioral policy, allowing the algorithm to maintain on-policy exploration with off-policy data. The authors use stochastic approximation theory to prove almost sure convergence to the optimal regularized value and policy, with a finite-time last-iterate rate of order k to the power -2/3 under stronger ergodicity assumptions on the Markov chain.

Core claim

The central claim is that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. The convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, the method achieves a last-iterate finite-time guarantee with Õ(k^{-2/3}) mean-square convergence.

What carries the argument

The two-timescale decomposition of the primal-dual optimization problem where the dual variable provides online guidance to the policy in the projected gradient descent-ascent scheme.

If this is right

The algorithm learns from a single trajectory of correlated samples without a separate simulator.
Convergence holds with weaker assumptions than prior primal-dual methods in RL.
It achieves the best-known convergence rate for two-timescale stochastic approximation under Markovian sampling and biased gradients.
The method integrates experience replay with asynchronous updates for practical RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework could be applied to other stochastic optimization problems involving nested objectives with biased estimates.
Removing the fixed behavioral policy requirement may improve performance in environments with changing dynamics.
The approach suggests new ways to balance exploration and exploitation in off-policy learning by using the dual variable dynamically.

Load-bearing premise

The proof depends on standard stochastic approximation results applying to this asynchronous two-timescale system with biased gradient estimates obtained from a single trajectory.

What would settle it

Observe the behavior of the algorithm on a small, solvable MDP and check if the value function and policy estimates converge to the known optimum as iterations increase when sampling from a single non-ergodic trajectory.

read the original abstract

We study reinforcement learning by combining recent advances in regularized linear programming formulations with the classical theory of stochastic approximation. Motivated by the challenge of designing algorithms that leverage off-policy data while maintaining on-policy exploration, we propose PGDA-RL, a novel primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs). PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem. The algorithm operates asynchronously, interacts with the environment through a single trajectory of correlated data, and updates its policy online in response to the dual variable associated with the occupancy measure of the underlying MDP. We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, we establish a last-iterate finite-time guarantee with $\tilde{O} (k^{-2/3})$ mean-square convergence, aligning with the best-known rates for two-timescale stochastic approximation methods under Markovian sampling and biased gradient estimates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward a single-trajectory primal-dual RL algorithm with online dual guidance and claims almost-sure convergence under relaxed assumptions, but the analysis rests on standard stochastic approximation results whose fit to biased Markovian updates needs closer checking.

read the letter

The core contribution is PGDA-RL, which runs a two-timescale projected gradient descent-ascent on the regularized MDP linear program while estimating gradients from experience replay on one correlated trajectory and letting the dual variable guide the policy update online. This removes the usual simulator or fixed behavioral policy, which is a practical step forward for off-policy data use without breaking exploration. The algorithm description is clear and the high-level reduction to nested optimization plus stochastic approximation is straightforward. They also give a last-iterate finite-time rate of order k to the minus two-thirds under a strengthened ergodicity condition, which lines up with existing two-timescale bounds. Credit is due for making the dual variable drive the policy directly rather than treating it as a static regularizer. The soft spot is the convergence argument. The almost-sure claim invokes classical Borkar-style theorems for asynchronous two-timescale systems, yet the gradient estimates carry bias from the single Markov chain and the updates are not synchronized in the usual way. It is not automatic that the bias decays fast enough relative to the step-size schedules or that uniform ergodicity holds without extra mixing assumptions on the chain. The paper is honest about needing a strengthened ergodicity condition for the finite-time result, but that undercuts how much weaker the overall assumptions really are compared with prior primal-dual RL work. The citation pattern is appropriate and points to the relevant stochastic approximation and RL literature. This is aimed at researchers who work on optimization formulations of RL or who want to relax simulator requirements in primal-dual methods. A reader who already knows two-timescale SA will see the algorithmic novelty quickly. I would send it to peer review; the idea is concrete enough and the gaps are fixable with more explicit verification of the SA conditions rather than a fundamental flaw.

Referee Report

2 major / 2 minor

Summary. The paper proposes PGDA-RL, a two-timescale primal-dual Projected Gradient Descent-Ascent algorithm for regularized MDPs in reinforcement learning. It integrates experience replay-based gradient estimation with asynchronous updates from a single trajectory and proves almost sure convergence to the optimal value function and policy using stochastic approximation theory under weaker assumptions than existing approaches, without requiring a simulator or fixed behavioral policy. It also provides a finite-time last-iterate mean-square convergence rate of Õ(k^{-2/3}) under a strengthened ergodicity assumption.

Significance. Should the convergence analysis hold, the work offers a significant advancement in primal-dual RL methods by enabling online dual variable guidance for policy updates using off-policy data from a single trajectory. This removes restrictive assumptions like fixed behavioral policies, potentially leading to more practical algorithms. The alignment with optimal rates for two-timescale SA methods under Markovian sampling is a positive aspect, and the use of regularized LP formulations is well-motivated.

major comments (2)

[Convergence Analysis] The almost sure convergence claim depends on the applicability of standard stochastic approximation results (such as Borkar-type theorems for two-timescale systems) to the asynchronous updates with biased gradient estimates from a single correlated trajectory. The manuscript needs to explicitly show that the bias vanishes at a rate compatible with the step-size schedules and that the Markov chain satisfies the required ergodicity or mixing conditions without a fixed behavioral policy. This is load-bearing for the central claim of weaker assumptions.
[Finite-time Guarantee] The strengthened ergodicity assumption introduced for the Õ(k^{-2/3}) mean-square convergence rate should be discussed in relation to the assumptions in existing primal-dual RL literature to confirm that the overall set of assumptions remains weaker. Without this comparison, the improvement in the finite-time result is difficult to evaluate.

minor comments (2)

[Abstract] Clarify the exact meaning of the last-iterate finite-time guarantee and specify the dependence on the iteration index k.
[Notation] Ensure consistent use of symbols for the dual variable and occupancy measure throughout the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of our work's potential impact. We address each major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [Convergence Analysis] The almost sure convergence claim depends on the applicability of standard stochastic approximation results (such as Borkar-type theorems for two-timescale systems) to the asynchronous updates with biased gradient estimates from a single correlated trajectory. The manuscript needs to explicitly show that the bias vanishes at a rate compatible with the step-size schedules and that the Markov chain satisfies the required ergodicity or mixing conditions without a fixed behavioral policy. This is load-bearing for the central claim of weaker assumptions.

Authors: We appreciate the referee's emphasis on this foundational aspect. The proof of almost-sure convergence (Theorem 1 in Section 4) applies the two-timescale stochastic approximation framework of Borkar (2008) after establishing the requisite conditions in Lemmas 2--4. The bias from experience-replay gradient estimates is controlled by the replay buffer size and decays as O(1/k) under the chosen step-size schedules (α_k ∼ k^{-2/3}, β_k ∼ k^{-1/3}), which satisfies the summability requirements for a.s. convergence. The Markov chain is driven by the slowly varying policy; geometric ergodicity follows from the regularization ensuring a unique stationary distribution with mixing rate independent of any fixed behavior policy (see Assumption 3 and the discussion in Section 3.2). To address the request for explicit verification, we will insert a new subsection 4.1 that tabulates each Borkar condition and confirms its satisfaction, including the precise bias-decay rate. revision: partial
Referee: [Finite-time Guarantee] The strengthened ergodicity assumption introduced for the Õ(k^{-2/3}) mean-square convergence rate should be discussed in relation to the assumptions in existing primal-dual RL literature to confirm that the overall set of assumptions remains weaker. Without this comparison, the improvement in the finite-time result is difficult to evaluate.

Authors: We agree that an explicit comparison clarifies the contribution. The strengthened ergodicity condition (Assumption 4) is invoked solely for the finite-time last-iterate bound in Theorem 2 and is standard in the Markovian stochastic-approximation literature for attaining the optimal Õ(k^{-2/3}) rate. The almost-sure convergence result holds under the strictly weaker set of assumptions that dispense with both a simulator and a fixed behavioral policy. In the revision we will add a dedicated paragraph in Section 5 and a short comparison table (new Table 1) that contrasts Assumption 4 with the ergodicity conditions appearing in recent primal-dual RL works (e.g., those requiring a fixed behavior policy or i.i.d. sampling), thereby confirming that the core modeling assumptions remain weaker. revision: yes

Circularity Check

0 steps flagged

Convergence analysis invokes external stochastic approximation theorems without internal reduction

full rationale

The paper's almost-sure convergence result for PGDA-RL is derived by applying classical two-timescale stochastic approximation theorems (e.g., Borkar-type results) to the asynchronous primal-dual iterates with experience-replay gradients from a single trajectory. This structure does not reduce the claimed optimality or convergence to any quantity defined by the algorithm's own fitted parameters, self-referential definitions, or renamed empirical patterns. The finite-time bound under a strengthened ergodicity assumption similarly aligns with known rates for Markovian sampling and biased estimates, remaining independent of internal fits. No load-bearing self-citations or ansatzes smuggled via prior author work appear in the provided derivation chain; the analysis is self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard stochastic approximation theory for two-timescale systems and MDP regularity conditions; the main domain assumption is the strengthened ergodicity needed for the explicit rate.

axioms (1)

domain assumption Strengthened ergodicity assumption on the underlying Markov chain
Invoked to obtain the last-iterate finite-time guarantee of Õ(k^{-2/3}) mean-square convergence under Markovian sampling.

pith-pipeline@v0.9.0 · 5760 in / 1321 out tokens · 56396 ms · 2026-05-22T16:16:25.192847+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-timescale decomposition... last-iterate finite-time guarantee with Õ(k^{-2/3})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.