Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

Harsh Gupta; Lei Ying; R. Srikant

arxiv: 1907.06290 · v1 · pith:NKY2LTAUnew · submitted 2019-07-14 · 💻 cs.LG · cs.AI· stat.ML

Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

Harsh Gupta , R. Srikant , Lei Ying This is my paper

Pith reviewed 2026-05-24 21:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords two time-scale stochastic approximationfinite-time boundsadaptive learning ratereinforcement learningLyapunov functionsingular perturbationGTDTDC

0 comments

The pith

Finite-time bounds for two time-scale stochastic approximation are derived from a Lyapunov function taken from singular perturbation theory and then used to build an adaptive learning rate rule.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes finite-time performance bounds for two time-scale linear stochastic approximation algorithms with a fixed learning rate. The bounds rest on a Lyapunov function constructed by borrowing the structure of singular perturbation analysis for linear differential equations. These bounds are applied to create an adaptive learning rate schedule. Experiments on algorithms such as GTD, GTD2, and TDC show the adaptive schedule converges faster than the best known polynomial decay rule. A reader would care because the work supplies explicit error guarantees and a practical way to tune rates for algorithms that separate fast and slow updates.

Core claim

Finite-time performance bounds for two time-scale linear stochastic approximation algorithms with fixed learning rates are obtained by lifting a Lyapunov function motivated by singular perturbation theory for linear differential equations to the discrete-time recursion; the resulting bounds are then used to design an adaptive learning rate scheme that changes the rate according to the current error estimates and yields faster convergence than any predetermined polynomial schedule.

What carries the argument

Lyapunov function motivated by singular perturbation theory for linear differential equations, which bounds the coupled fast-slow discrete-time recursion.

If this is right

The adaptive scheme applies to any learning-rate schedule that changes values only at predetermined instants.
The same finite-time bound technique covers the fixed-rate case for any two-time-scale linear stochastic approximation.
Convergence guarantees are supplied for the family of algorithms modeled by these recursions, including GTD, GTD2, and TDC.
The adaptive rule can be combined with existing predetermined schedules without redesigning the underlying analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Lyapunov construction may supply starting points for analyzing two-time-scale methods beyond linear stochastic approximation.
The adaptive schedule could be tested on non-stationary environments where the optimal decay rate is unknown in advance.
The separation of time scales that the bound exploits may also appear in actor-critic or other multi-timescale RL procedures.
Implementation cost of the adaptive rule is low enough that it could be tried on large-scale problems where polynomial schedules are currently used.

Load-bearing premise

The continuous-time Lyapunov function motivated by singular perturbation theory transfers directly to the discrete stochastic recursion with no extra technical gaps.

What would settle it

A numerical counter-example in which the proposed Lyapunov function fails to produce a valid finite-time bound for the two-time-scale recursion, or repeated experiments in which the adaptive schedule shows no improvement over the optimal polynomial decay rule on standard GTD/TDC benchmarks.

read the original abstract

We study two time-scale linear stochastic approximation algorithms, which can be used to model well-known reinforcement learning algorithms such as GTD, GTD2, and TDC. We present finite-time performance bounds for the case where the learning rate is fixed. The key idea in obtaining these bounds is to use a Lyapunov function motivated by singular perturbation theory for linear differential equations. We use the bound to design an adaptive learning rate scheme which significantly improves the convergence rate over the known optimal polynomial decay rule in our experiments, and can be used to potentially improve the performance of any other schedule where the learning rate is changed at pre-determined time instants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Finite-time bounds for two-time-scale linear SA via singular-perturbation Lyapunov, plus an adaptive rate derived from the bound that beats polynomial decay in experiments, though the discrete analysis needs verification.

read the letter

This paper delivers finite-time performance bounds for two-time-scale linear stochastic approximation using a Lyapunov function from singular perturbation theory, and derives an adaptive learning rate from those bounds that beats the known optimal polynomial decay in experiments. The contribution is the application of the singular-perturbation Lyapunov to obtain explicit finite-time results for the fixed-rate case in the two-time-scale setting. This avoids the usual assumption that the fast scale has converged, which is useful for modeling GTD-type RL algorithms. The adaptive scheme is a direct consequence of the bound and shows practical gains. A potential issue is whether the continuous Lyapunov translates rigorously to the discrete stochastic recursion. The one-step drift must retain a negative definite component uniform in the time-scale ratio, with discretization and martingale terms properly bounded. The abstract highlights the continuous motivation without mentioning additional discrete analysis, so that part warrants close inspection in the full paper. The work is grounded in standard stochastic approximation techniques. No circularity appears in the claims. This is aimed at the RL theory community working on finite-time analysis and adaptive methods. Readers focused on two-time-scale algorithms will find the bounds and the adaptive rule relevant. It merits peer review because the idea is fresh and the experimental improvement is concrete. Recommendation: yes, send to referees, but flag the discrete-to-continuous lift for careful checking.

Referee Report

1 major / 2 minor

Summary. The paper studies two time-scale linear stochastic approximation recursions that model algorithms such as GTD, GTD2 and TDC. It derives finite-time performance bounds for fixed learning rates by constructing a Lyapunov function motivated by singular perturbation analysis of the associated linear ODE, then uses the resulting bound to design an adaptive learning-rate rule that is shown in experiments to converge faster than the optimal polynomial-decay schedule.

Significance. If the discrete-time analysis is rigorous, the finite-time bounds supply the first non-asymptotic guarantees for this class of two-time-scale RL methods and the adaptive rule offers a practical improvement over existing schedules. The singular-perturbation Lyapunov construction is a technically interesting device whose correctness would strengthen the contribution.

major comments (1)

[derivation of the finite-time bound (following introduction of the Lyapunov function)] The central finite-time claim rests on lifting the continuous-time Lyapunov function (motivated by singular perturbation theory for linear ODEs) to the discrete two-time-scale recursion. The manuscript must explicitly show that the one-step drift of this Lyapunov inherits a negative-definite term whose constants remain uniform in the time-scale ratio, while all O(α+β) discretization errors and martingale variance terms are controlled; without this verification the finite-time guarantee is not yet rigorous.

minor comments (2)

State the precise assumptions on the noise processes and on the matrices A, B, C (including any eigenvalue conditions required for the singular-perturbation argument) in a single, clearly labeled assumption block.
In the experimental section, report the precise functional form of the adaptive rule, the range of time-scale ratios tested, and whether the comparison schedules were tuned with the same total computational budget.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. The major comment concerns the explicit verification of the one-step drift in the finite-time bound. We address this point below and indicate the planned revision.

read point-by-point responses

Referee: [derivation of the finite-time bound (following introduction of the Lyapunov function)] The central finite-time claim rests on lifting the continuous-time Lyapunov function (motivated by singular perturbation theory for linear ODEs) to the discrete two-time-scale recursion. The manuscript must explicitly show that the one-step drift of this Lyapunov inherits a negative-definite term whose constants remain uniform in the time-scale ratio, while all O(α+β) discretization errors and martingale variance terms are controlled; without this verification the finite-time guarantee is not yet rigorous.

Authors: We agree that an explicit verification of the one-step drift is essential for rigor. In the manuscript the Lyapunov function is constructed from the singular-perturbation analysis of the limiting linear ODE so that its continuous-time derivative is negative definite with a rate independent of the time-scale ratio (under the standard stability assumptions on the fast and slow matrices). The discrete analysis then decomposes the conditional expectation of the one-step change into (i) the continuous drift term, (ii) an O(α+β) discretization error controlled by the Lipschitz constant of the drift vector field, and (iii) a martingale difference whose second-moment is bounded by a constant independent of the ratio. The resulting supermartingale inequality yields the finite-time bound with constants uniform in the ratio. Nevertheless, the current write-up condenses several algebraic steps. We will expand the proof (in the main text or a new appendix) to display each term explicitly, confirm uniformity of the negative-definite constant, and restate the error bounds, thereby addressing the referee’s concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity; bounds derived via external singular-perturbation Lyapunov applied to discrete recursion

full rationale

The paper's central claim is a finite-time bound for fixed learning-rate two-time-scale linear stochastic approximation, obtained by constructing a Lyapunov function motivated by singular perturbation theory for linear differential equations and lifting it to the discrete recursion. This technique is drawn from external continuous-time analysis rather than any self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The subsequent adaptive learning-rate design is presented as an experimental improvement over known polynomial-decay rules, without the bound itself reducing tautologically to the paper's inputs. No enumerated circularity pattern is exhibited in the abstract or described derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; ledger left empty pending full text.

pith-pipeline@v0.9.0 · 5637 in / 1113 out tokens · 31493 ms · 2026-05-24T21:27:44.509710+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The key idea in obtaining these bounds is to use a Lyapunov function motivated by singular perturbation theory for linear differential equations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

W(u,z) = d u^T P_u u + (1-d) z^T P_v z … ˙W ≤ −λ̃_min (‖u‖² + ‖z‖²)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The McKay correspondence and local heights for wild-by-tame split metacyclic groups
math.AG 2026-05 unverdicted novelty 7.0

For wild-by-tame split metacyclic groups, the stringy Euler number of the quotient depends on the representation and is not always equal to the number of conjugacy classes, unlike the classical McKay case.