Finite-Time Decoupled Convergence in Nonlinear Two-Time-Scale Stochastic Approximation

Xiang Li; Yuze Han; Zhihua Zhang

arxiv: 2401.03893 · v4 · submitted 2024-01-08 · 🧮 math.OC · stat.ML

Finite-Time Decoupled Convergence in Nonlinear Two-Time-Scale Stochastic Approximation

Yuze Han , Xiang Li , Zhihua Zhang This is my paper

Pith reviewed 2026-05-24 04:12 UTC · model grok-4.3

classification 🧮 math.OC stat.ML

keywords two-time-scale stochastic approximationdecoupled convergencenonlinear SAfinite-time analysislocal linearitymean-square errorstep-size selectioncross-term analysis

0 comments

The pith

Nonlinear two-time-scale stochastic approximation achieves finite-time decoupled convergence under nested local linearity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies whether the decoupled convergence property known for linear two-time-scale stochastic approximation extends to the nonlinear setting. It establishes that, when the updates satisfy a nested local linearity condition, the mean-square errors of the slow and fast iterates converge at rates that depend only on their individual step sizes. The proof proceeds by bounding the cross term between the two iterates and using fourth-order moment convergence to control the extra error terms created by nonlinearity. A counter-example shows that nonlinearity in the slow update alone suffices to break decoupling even when the fast update remains linear.

Core claim

Under the nested local linearity assumption, finite-time mean-square convergence rates in nonlinear two-time-scale SA become decoupled: each iterate's error decays at a rate governed solely by its own step size, obtained by choosing step sizes appropriately and controlling higher-order terms through fourth-order moment bounds on the iterates.

What carries the argument

Nested local linearity assumption on the nonlinear updates, which permits bounding higher-order error terms via fourth-order moment convergence rates while analyzing the matrix cross term between the slow and fast iterates.

If this is right

Decoupled finite-time rates become available for a class of nonlinear two-time-scale recursions once step sizes are chosen to respect the local linearity scale.
The matrix cross-term analysis plus fourth-order moment control supplies the technical bridge from linear to locally linear nonlinear SA.
Nonlinearity confined to the slow-time-scale update is already enough to destroy decoupling, even if the fast update is exactly linear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Local linearity may serve as a minimal structural condition that preserves decoupling across other multi-scale stochastic algorithms.
The counter-example suggests that verifying local linearity on real data could be a practical test for whether decoupled rates are attainable.
If the assumption holds only in a shrinking neighborhood, the result may still apply after a finite burn-in period once iterates enter that neighborhood.

Load-bearing premise

The updates obey a nested local linearity condition that lets fourth-order moments control the nonlinear remainder terms.

What would settle it

A concrete nonlinear two-time-scale example in which the nested local linearity condition is violated yet the mean-square errors still converge at rates depending only on the separate step sizes.

Figures

Figures reproduced from arXiv: 2401.03893 by Xiang Li, Yuze Han, Zhihua Zhang.

**Figure 2.** Figure 2: The convergence results within Example 4.1. We calculate the line slopes using data from the 105 to 106 iteration range. 4.1 An Example without Local Linearity The central limit theorem in Mokkadem and Pelletier [2006] and our non-asymptotic convergence rates in Section 3 demonstrate that for nonlinear two-time-scale SA, under certain local linearity assumptions, employing step sizes decaying at different … view at source ↗

**Figure 3.** Figure 3: The convergence results for unbiased two-time-scale SA to solve ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: The convergence results for the example in ( [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: The convergence results for biased two-time-scale SA to solve ( [PITH_FULL_IMAGE:figures/full_fig_p070_5.png] view at source ↗

**Figure 6.** Figure 6: The convergence results for SGD with Polyak-Ruppert averaging ( [PITH_FULL_IMAGE:figures/full_fig_p071_6.png] view at source ↗

**Figure 7.** Figure 7: The convergence results for simultaneous SHB ( [PITH_FULL_IMAGE:figures/full_fig_p071_7.png] view at source ↗

**Figure 8.** Figure 8: The convergence results for alternating SHB ( [PITH_FULL_IMAGE:figures/full_fig_p071_8.png] view at source ↗

read the original abstract

In two-time-scale stochastic approximation (SA), two iterates are updated at varying speeds using different step sizes, with each update influencing the other. Previous studies on linear two-time-scale SA have shown that the convergence rates of the mean-square errors for these updates depend solely on their respective step sizes, a phenomenon termed decoupled convergence. However, achieving decoupled convergence in nonlinear SA remains less understood. Our research investigates the potential for finite-time decoupled convergence in nonlinear two-time-scale SA. We demonstrate that, under a nested local linearity assumption, finite-time decoupled convergence rates can be achieved with suitable step size selection. To derive this result, we conduct a convergence analysis of the matrix cross term between the iterates and leverage fourth-order moment convergence rates to control the higher-order error terms induced by local linearity. To further investigate the necessity of local linearity for decoupled convergence, we also construct an example showing that, even when the fast-time-scale update is linear, the nonlinearity of the slow-time-scale update alone can destroy decoupled convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends decoupled finite-time rates from linear to nonlinear two-time-scale SA under a nested local linearity assumption and supplies a counterexample showing slow-scale nonlinearity can break decoupling.

read the letter

The core result is that nonlinear two-time-scale stochastic approximation can achieve finite-time decoupled convergence rates when the updates satisfy a nested local linearity condition, with step sizes chosen appropriately. They also give an explicit counterexample where nonlinearity only on the slow scale destroys the decoupling even if the fast scale stays linear. This moves beyond the linear cases that dominated earlier work. The technical steps involve analyzing the matrix cross term between the two iterates and using fourth-order moment bounds to handle the higher-order errors from the local-linear approximation. That approach looks reasonable on the surface and directly addresses the gap they identify. The counterexample is a useful addition because it shows the assumption is not just convenient but necessary in some cases. The main soft spot is that the nested local linearity assumption carries a lot of the load; without it the higher-order terms are not controlled, and the paper is upfront about this scoping. Checking the precise moment bounds and cross-term estimates would require going through the full proofs, but nothing in the setup suggests circularity or invented rates. This work sits squarely in the stochastic approximation literature used for two-time-scale optimization in machine learning. Readers already following finite-time analyses of SA will find the extension and the counterexample worth their time. It is not reshaping the broader field but fills a clear technical hole with concrete statements. I would send it to peer review; the claims are scoped and the counterexample adds value, so referees can check the details without the paper overclaiming.

Referee Report

2 major / 1 minor

Summary. The paper claims that under a nested local linearity assumption on the nonlinear updates, finite-time decoupled convergence rates (depending only on the respective step sizes) can be achieved in two-time-scale stochastic approximation via suitable step-size selection. The argument proceeds by analyzing the matrix cross term between the fast and slow iterates and invoking fourth-order moment convergence to bound the higher-order error terms induced by the local-linear approximation. The paper also supplies an explicit counter-example demonstrating that nonlinearity on the slow scale alone suffices to destroy decoupling even when the fast scale is linear.

Significance. If the central claim holds, the result extends the decoupled-convergence phenomenon from linear two-time-scale SA to a nontrivial nonlinear regime while making the scope of the local-linearity assumption explicit via the counter-example. The use of fourth-order moment bounds to close the higher-order terms and the provision of a concrete counter-example are concrete strengths that clarify necessity of the assumption.

major comments (2)

[matrix cross-term analysis] The convergence analysis of the matrix cross term (invoked to obtain the decoupled rates) is load-bearing; the manuscript should state explicitly in which section or lemma the cross-term bound is derived under the nested local-linearity assumption and confirm that the resulting rate remains independent of the other time-scale's step size.
[fourth-order moment bounds] The fourth-order moment convergence rates are used to control the remainder terms from the local-linear approximation; the paper should verify that these moment bounds themselves do not introduce hidden dependence on the slow-scale step size, which would undermine the decoupling claim.

minor comments (1)

Notation for the nested local-linearity assumption should be introduced once and used consistently; currently the assumption appears under slightly varying verbal descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and recommendation of minor revision. The two major comments concern clarity on load-bearing technical steps; we address them point by point below and will incorporate explicit statements and verifications in the revised manuscript.

read point-by-point responses

Referee: The convergence analysis of the matrix cross term (invoked to obtain the decoupled rates) is load-bearing; the manuscript should state explicitly in which section or lemma the cross-term bound is derived under the nested local-linearity assumption and confirm that the resulting rate remains independent of the other time-scale's step size.

Authors: The matrix cross-term analysis is carried out as part of the convergence argument for the main result, directly under the nested local-linearity assumption. The derivation decomposes the cross term and applies the local-linearity condition to bound the interaction so that its contribution to each mean-square error depends only on the corresponding step-size sequence. We will revise the manuscript to add an explicit pointer to this portion of the argument together with a remark confirming that the resulting rate for each iterate is independent of the other time scale's step size. revision: yes
Referee: The fourth-order moment convergence rates are used to control the remainder terms from the local-linear approximation; the paper should verify that these moment bounds themselves do not introduce hidden dependence on the slow-scale step size, which would undermine the decoupling claim.

Authors: The fourth-order moment bounds are obtained separately for each time scale using only the respective step-size sequences; the nested local-linearity assumption then ensures that the remainder terms arising from the approximation do not create additional coupling that would import dependence on the slow-scale step size into the fast-scale bounds (or vice versa). We will add a short verification remark immediately after the moment-bound statements to make this independence explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated assumptions

full rationale

The paper derives finite-time rates for nonlinear two-time-scale SA from the nested local linearity assumption, cross-term analysis, and fourth-order moment bounds to control approximation errors. It explicitly constructs a counter-example showing slow-scale nonlinearity suffices to break decoupling. No equations reduce by construction to inputs, no parameters are fitted then renamed as predictions, and no load-bearing claims rest on self-citations or imported uniqueness results. The argument is scoped to the assumption and remains falsifiable via the provided counter-example.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the nested local linearity assumption and fourth-order moment convergence rates; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption nested local linearity assumption
Invoked to control higher-order error terms induced by nonlinearity.

pith-pipeline@v0.9.0 · 5703 in / 984 out tokens · 17516 ms · 2026-05-24T04:12:31.878019+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under the nested local linearity assumption (Assumption 2.5), we derive detailed convergence rates for E∥x̂t+1∥², E∥ŷt+1∥² and ∥E(x̂t+1 ŷt+1⊤)∥ … with appropriate step size selection.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage fourth-order moment convergence rates to control the higher-order error terms induced by local linearity.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Concentration bounds for two time scale stochastic approxi- mation

Vivek S Borkar and Sarath Pattathil. Concentration bounds for two time scale stochastic approxi- mation. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 504–511. IEEE,

work page 2018
[2]

Two- timescale q-learning with function approximation in zero-sum stochastic games.arXiv preprint arXiv:2312.04905,

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Two- timescale q-learning with function approximation in zero-sum stochastic games.arXiv preprint arXiv:2312.04905,

work page arXiv
[3]

Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under Markovian noise.arXiv preprint arXiv:2104.01627,

Thinh T Doan. Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under Markovian noise.arXiv preprint arXiv:2104.01627,

work page arXiv
[4]

Fast nonlinear two-time-scale stochastic approximation: AchievingO(1/k) finite- sample complexity.arXiv preprint arXiv:2401.12764,

Thinh T Doan. Fast nonlinear two-time-scale stochastic approximation: AchievingO(1/k) finite- sample complexity.arXiv preprint arXiv:2401.12764,

work page arXiv
[5]

Functional central limit theorem for two timescale stochastic approximation

Fathima Zarin Faizal and Vivek Borkar. Functional central limit theorem for two timescale stochastic approximation. arXiv preprint arXiv:2306.05723,

work page arXiv
[6]

Approximation Methods for Bilevel Programming

72 Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Finite-Time Decoupled Convergence in Nonlinear Two-Time-Scale Stochastic Approximation

URL https: //arxiv.org/pdf/2401.03893v1. Shaan Ul Haque, Sajad Khodadadian, and Siva Theja Maguluri. Tight finite time bounds of two- time-scale linear stochastic approximation with markovian noise.arXiv preprint arXiv:2401.00364,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Single-timescale multi-sequence stochas- tic approximation without fixed point smoothness: Theories and applications.arXiv preprint arXiv:2410.13743,

Yue Huang, Zhaoxian Wu, Shiqian Ma, and Qing Ling. Single-timescale multi-sequence stochas- tic approximation without fixed point smoothness: Theories and applications.arXiv preprint arXiv:2410.13743,

work page arXiv
[10]

Two-timescale linear stochastic approximation: Constant stepsizes go a long way.arXiv preprint arXiv:2410.13067,

73 Jeongyeol Kwon, Luke Dotson, Yudong Chen, and Qiaomin Xie. Two-timescale linear stochastic approximation: Constant stepsizes go a long way.arXiv preprint arXiv:2410.13067,

work page arXiv
[11]

Online statistical inference for nonlinear stochastic approximation with Markovian data.arXiv preprint arXiv:2302.07690, 2023a

Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with Markovian data.arXiv preprint arXiv:2302.07690, 2023a. Xiang Li, Wenhao Yang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of Polyak- Ruppert averaged Q-learning. InInternational Conference on Artificial Intelligence and Statis...

work page arXiv
[12]

Opti- mal variance-reduced stochastic approximation in Banach spaces.arXiv preprint arXiv:2201.08518, 2022a

Wenlong Mou, Koulik Khamaru, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. Opti- mal variance-reduced stochastic approximation in Banach spaces.arXiv preprint arXiv:2201.08518, 2022a. Wenlong Mou, Ashwin Pananjady, and Martin J Wainwright. Optimal oracle inequalities for projected fixed-point equations, with applications to policy evaluatio...

work page arXiv
[13]

Two-timescale stochastic approximation for bilevel optimisation problems in continuous-time models.arXiv preprint arXiv:2206.06995,

Louis Sharrock. Two-timescale stochastic approximation for bilevel optimisation problems in continuous-time models.arXiv preprint arXiv:2206.06995,

work page arXiv
[14]

Almost sure convergence of two time-scale stochastic approximation algorithms

Vladislav B Tadic. Almost sure convergence of two time-scale stochastic approximation algorithms. In Proceedings of the 2004 American Control Conference, volume 4, pages 3802–3807. IEEE,

work page 2004
[15]

Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms.arXiv preprint arXiv:2005.03557,

Tengyu Xu, Zhe Wang, and Yingbin Liang. Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms.arXiv preprint arXiv:2005.03557,

work page arXiv 2005

[1] [1]

Concentration bounds for two time scale stochastic approxi- mation

Vivek S Borkar and Sarath Pattathil. Concentration bounds for two time scale stochastic approxi- mation. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 504–511. IEEE,

work page 2018

[2] [2]

Two- timescale q-learning with function approximation in zero-sum stochastic games.arXiv preprint arXiv:2312.04905,

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Two- timescale q-learning with function approximation in zero-sum stochastic games.arXiv preprint arXiv:2312.04905,

work page arXiv

[3] [3]

Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under Markovian noise.arXiv preprint arXiv:2104.01627,

Thinh T Doan. Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under Markovian noise.arXiv preprint arXiv:2104.01627,

work page arXiv

[4] [4]

Fast nonlinear two-time-scale stochastic approximation: AchievingO(1/k) finite- sample complexity.arXiv preprint arXiv:2401.12764,

Thinh T Doan. Fast nonlinear two-time-scale stochastic approximation: AchievingO(1/k) finite- sample complexity.arXiv preprint arXiv:2401.12764,

work page arXiv

[5] [5]

Functional central limit theorem for two timescale stochastic approximation

Fathima Zarin Faizal and Vivek Borkar. Functional central limit theorem for two timescale stochastic approximation. arXiv preprint arXiv:2306.05723,

work page arXiv

[6] [6]

Approximation Methods for Bilevel Programming

72 Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Finite-Time Decoupled Convergence in Nonlinear Two-Time-Scale Stochastic Approximation

URL https: //arxiv.org/pdf/2401.03893v1. Shaan Ul Haque, Sajad Khodadadian, and Siva Theja Maguluri. Tight finite time bounds of two- time-scale linear stochastic approximation with markovian noise.arXiv preprint arXiv:2401.00364,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Single-timescale multi-sequence stochas- tic approximation without fixed point smoothness: Theories and applications.arXiv preprint arXiv:2410.13743,

Yue Huang, Zhaoxian Wu, Shiqian Ma, and Qing Ling. Single-timescale multi-sequence stochas- tic approximation without fixed point smoothness: Theories and applications.arXiv preprint arXiv:2410.13743,

work page arXiv

[9] [10]

Two-timescale linear stochastic approximation: Constant stepsizes go a long way.arXiv preprint arXiv:2410.13067,

73 Jeongyeol Kwon, Luke Dotson, Yudong Chen, and Qiaomin Xie. Two-timescale linear stochastic approximation: Constant stepsizes go a long way.arXiv preprint arXiv:2410.13067,

work page arXiv

[10] [11]

Online statistical inference for nonlinear stochastic approximation with Markovian data.arXiv preprint arXiv:2302.07690, 2023a

Xiang Li, Jiadong Liang, and Zhihua Zhang. Online statistical inference for nonlinear stochastic approximation with Markovian data.arXiv preprint arXiv:2302.07690, 2023a. Xiang Li, Wenhao Yang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of Polyak- Ruppert averaged Q-learning. InInternational Conference on Artificial Intelligence and Statis...

work page arXiv

[11] [12]

Opti- mal variance-reduced stochastic approximation in Banach spaces.arXiv preprint arXiv:2201.08518, 2022a

Wenlong Mou, Koulik Khamaru, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. Opti- mal variance-reduced stochastic approximation in Banach spaces.arXiv preprint arXiv:2201.08518, 2022a. Wenlong Mou, Ashwin Pananjady, and Martin J Wainwright. Optimal oracle inequalities for projected fixed-point equations, with applications to policy evaluatio...

work page arXiv

[12] [13]

Two-timescale stochastic approximation for bilevel optimisation problems in continuous-time models.arXiv preprint arXiv:2206.06995,

Louis Sharrock. Two-timescale stochastic approximation for bilevel optimisation problems in continuous-time models.arXiv preprint arXiv:2206.06995,

work page arXiv

[13] [14]

Almost sure convergence of two time-scale stochastic approximation algorithms

Vladislav B Tadic. Almost sure convergence of two time-scale stochastic approximation algorithms. In Proceedings of the 2004 American Control Conference, volume 4, pages 3802–3807. IEEE,

work page 2004

[14] [15]

Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms.arXiv preprint arXiv:2005.03557,

Tengyu Xu, Zhe Wang, and Yingbin Liang. Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms.arXiv preprint arXiv:2005.03557,

work page arXiv 2005