pith. sign in

arxiv: 2505.04494 · v3 · submitted 2025-05-07 · 🧮 math.OC · cs.LG

A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance

Pith reviewed 2026-05-22 16:16 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords reinforcement learningprimal-dual optimizationstochastic approximationregularized MDPstwo-timescale methodsalmost sure convergenceonline dual guidance
0
0 comments X

The pith

PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP from a single trajectory of data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops PGDA-RL, a primal-dual algorithm for regularized MDPs that combines projected gradient descent-ascent with a two-timescale structure. The dual variable for the occupancy measure guides the policy updates online while gradients are estimated asynchronously from experience replay on one trajectory of data. This design removes the need for a simulator or fixed behavioral policy, allowing the algorithm to maintain on-policy exploration with off-policy data. The authors use stochastic approximation theory to prove almost sure convergence to the optimal regularized value and policy, with a finite-time last-iterate rate of order k to the power -2/3 under stronger ergodicity assumptions on the Markov chain.

Core claim

The central claim is that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. The convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, the method achieves a last-iterate finite-time guarantee with Õ(k^{-2/3}) mean-square convergence.

What carries the argument

The two-timescale decomposition of the primal-dual optimization problem where the dual variable provides online guidance to the policy in the projected gradient descent-ascent scheme.

If this is right

  • The algorithm learns from a single trajectory of correlated samples without a separate simulator.
  • Convergence holds with weaker assumptions than prior primal-dual methods in RL.
  • It achieves the best-known convergence rate for two-timescale stochastic approximation under Markovian sampling and biased gradients.
  • The method integrates experience replay with asynchronous updates for practical RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could be applied to other stochastic optimization problems involving nested objectives with biased estimates.
  • Removing the fixed behavioral policy requirement may improve performance in environments with changing dynamics.
  • The approach suggests new ways to balance exploration and exploitation in off-policy learning by using the dual variable dynamically.

Load-bearing premise

The proof depends on standard stochastic approximation results applying to this asynchronous two-timescale system with biased gradient estimates obtained from a single trajectory.

What would settle it

Observe the behavior of the algorithm on a small, solvable MDP and check if the value function and policy estimates converge to the known optimum as iterations increase when sampling from a single non-ergodic trajectory.

read the original abstract

We study reinforcement learning by combining recent advances in regularized linear programming formulations with the classical theory of stochastic approximation. Motivated by the challenge of designing algorithms that leverage off-policy data while maintaining on-policy exploration, we propose PGDA-RL, a novel primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs). PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem. The algorithm operates asynchronously, interacts with the environment through a single trajectory of correlated data, and updates its policy online in response to the dual variable associated with the occupancy measure of the underlying MDP. We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, we establish a last-iterate finite-time guarantee with $\tilde{O} (k^{-2/3})$ mean-square convergence, aligning with the best-known rates for two-timescale stochastic approximation methods under Markovian sampling and biased gradient estimates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PGDA-RL, a two-timescale primal-dual Projected Gradient Descent-Ascent algorithm for regularized MDPs in reinforcement learning. It integrates experience replay-based gradient estimation with asynchronous updates from a single trajectory and proves almost sure convergence to the optimal value function and policy using stochastic approximation theory under weaker assumptions than existing approaches, without requiring a simulator or fixed behavioral policy. It also provides a finite-time last-iterate mean-square convergence rate of Õ(k^{-2/3}) under a strengthened ergodicity assumption.

Significance. Should the convergence analysis hold, the work offers a significant advancement in primal-dual RL methods by enabling online dual variable guidance for policy updates using off-policy data from a single trajectory. This removes restrictive assumptions like fixed behavioral policies, potentially leading to more practical algorithms. The alignment with optimal rates for two-timescale SA methods under Markovian sampling is a positive aspect, and the use of regularized LP formulations is well-motivated.

major comments (2)
  1. [Convergence Analysis] The almost sure convergence claim depends on the applicability of standard stochastic approximation results (such as Borkar-type theorems for two-timescale systems) to the asynchronous updates with biased gradient estimates from a single correlated trajectory. The manuscript needs to explicitly show that the bias vanishes at a rate compatible with the step-size schedules and that the Markov chain satisfies the required ergodicity or mixing conditions without a fixed behavioral policy. This is load-bearing for the central claim of weaker assumptions.
  2. [Finite-time Guarantee] The strengthened ergodicity assumption introduced for the Õ(k^{-2/3}) mean-square convergence rate should be discussed in relation to the assumptions in existing primal-dual RL literature to confirm that the overall set of assumptions remains weaker. Without this comparison, the improvement in the finite-time result is difficult to evaluate.
minor comments (2)
  1. [Abstract] Clarify the exact meaning of the last-iterate finite-time guarantee and specify the dependence on the iteration index k.
  2. [Notation] Ensure consistent use of symbols for the dual variable and occupancy measure throughout the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of our work's potential impact. We address each major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses
  1. Referee: [Convergence Analysis] The almost sure convergence claim depends on the applicability of standard stochastic approximation results (such as Borkar-type theorems for two-timescale systems) to the asynchronous updates with biased gradient estimates from a single correlated trajectory. The manuscript needs to explicitly show that the bias vanishes at a rate compatible with the step-size schedules and that the Markov chain satisfies the required ergodicity or mixing conditions without a fixed behavioral policy. This is load-bearing for the central claim of weaker assumptions.

    Authors: We appreciate the referee's emphasis on this foundational aspect. The proof of almost-sure convergence (Theorem 1 in Section 4) applies the two-timescale stochastic approximation framework of Borkar (2008) after establishing the requisite conditions in Lemmas 2--4. The bias from experience-replay gradient estimates is controlled by the replay buffer size and decays as O(1/k) under the chosen step-size schedules (α_k ∼ k^{-2/3}, β_k ∼ k^{-1/3}), which satisfies the summability requirements for a.s. convergence. The Markov chain is driven by the slowly varying policy; geometric ergodicity follows from the regularization ensuring a unique stationary distribution with mixing rate independent of any fixed behavior policy (see Assumption 3 and the discussion in Section 3.2). To address the request for explicit verification, we will insert a new subsection 4.1 that tabulates each Borkar condition and confirms its satisfaction, including the precise bias-decay rate. revision: partial

  2. Referee: [Finite-time Guarantee] The strengthened ergodicity assumption introduced for the Õ(k^{-2/3}) mean-square convergence rate should be discussed in relation to the assumptions in existing primal-dual RL literature to confirm that the overall set of assumptions remains weaker. Without this comparison, the improvement in the finite-time result is difficult to evaluate.

    Authors: We agree that an explicit comparison clarifies the contribution. The strengthened ergodicity condition (Assumption 4) is invoked solely for the finite-time last-iterate bound in Theorem 2 and is standard in the Markovian stochastic-approximation literature for attaining the optimal Õ(k^{-2/3}) rate. The almost-sure convergence result holds under the strictly weaker set of assumptions that dispense with both a simulator and a fixed behavioral policy. In the revision we will add a dedicated paragraph in Section 5 and a short comparison table (new Table 1) that contrasts Assumption 4 with the ergodicity conditions appearing in recent primal-dual RL works (e.g., those requiring a fixed behavior policy or i.i.d. sampling), thereby confirming that the core modeling assumptions remain weaker. revision: yes

Circularity Check

0 steps flagged

Convergence analysis invokes external stochastic approximation theorems without internal reduction

full rationale

The paper's almost-sure convergence result for PGDA-RL is derived by applying classical two-timescale stochastic approximation theorems (e.g., Borkar-type results) to the asynchronous primal-dual iterates with experience-replay gradients from a single trajectory. This structure does not reduce the claimed optimality or convergence to any quantity defined by the algorithm's own fitted parameters, self-referential definitions, or renamed empirical patterns. The finite-time bound under a strengthened ergodicity assumption similarly aligns with known rates for Markovian sampling and biased estimates, remaining independent of internal fits. No load-bearing self-citations or ansatzes smuggled via prior author work appear in the provided derivation chain; the analysis is self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard stochastic approximation theory for two-timescale systems and MDP regularity conditions; the main domain assumption is the strengthened ergodicity needed for the explicit rate.

axioms (1)
  • domain assumption Strengthened ergodicity assumption on the underlying Markov chain
    Invoked to obtain the last-iterate finite-time guarantee of Õ(k^{-2/3}) mean-square convergence under Markovian sampling.

pith-pipeline@v0.9.0 · 5760 in / 1321 out tokens · 56396 ms · 2026-05-22T16:16:25.192847+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.