On ergodicity of the SAGA-LD algorithm

Mikl\'os R\'asonyi

arxiv: 2604.12815 · v1 · submitted 2026-04-14 · 🧮 math.PR

On ergodicity of the SAGA-LD algorithm

Mikl\'os R\'asonyi This is my paper

Pith reviewed 2026-05-10 14:19 UTC · model grok-4.3

classification 🧮 math.PR

keywords SAGA-LDergodicitylimiting distributionlaw of large numbersstochastic gradientsampling algorithmmachine learning

0 comments

The pith

The SAGA-LD algorithm converges to a limiting distribution with a law of large numbers holding for its time averages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes ergodicity for the SAGA-LD sampling algorithm used in machine learning. This convergence means the algorithm's output distribution stabilizes over time, enabling reliable sampling from high-dimensional targets. The authors also prove a law of large numbers, so that empirical averages computed along the algorithm's path match the integrals with respect to the target measure. Standard techniques from Markov chain theory do not apply because of the algorithm's gradient memory and stochastic updates, prompting the use of a specialized proof technique instead.

Core claim

Using a model-specific method, the SAGA-LD algorithm is proven to converge to a limiting distribution. A law of large numbers is shown to hold for the ergodic averages produced by the algorithm.

What carries the argument

The model-specific method developed to prove convergence and the law of large numbers for the intricate dynamics of SAGA-LD.

If this is right

The algorithm produces asymptotically correct samples from the target distribution.
Ergodic theorems allow consistent estimation of expectations via trajectory averages.
SAGA-LD can be applied in high-dimensional settings with theoretical guarantees on its long-run behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar custom methods could be developed for other non-standard sampling algorithms in machine learning.
The result highlights the need for bespoke analysis when variance reduction techniques complicate the Markov property.
One could test the convergence numerically for specific distributions to check the practical range of the theorem's assumptions.

Load-bearing premise

The model-specific proof requires particular conditions on the target distribution, step sizes, and memory parameters that remain unspecified in the abstract.

What would settle it

A concrete counterexample consisting of a target distribution and algorithm parameters where SAGA-LD does not converge to a unique stationary distribution would falsify the result.

read the original abstract

The so-called SAGA-LD algorithm is used for efficient sampling from high-dimensional distributions in machine learning. Its intricate dynamics resists standard approaches of Markov chain theory. We prove, using a model-specific method, that SAGA-LD converges to a limiting distribution and a law of large numbers holds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rásonyi gives a model-specific proof that SAGA-LD is ergodic and obeys a LLN, but the abstract leaves the needed conditions on the target, steps, and memory completely unstated.

read the letter

The paper's core contribution is a direct proof that the SAGA-LD sampler converges to a stationary distribution and that time averages converge, using an argument built around the algorithm's particular structure instead of generic Markov-chain machinery. That is new for this sampler, which is already run in high-dimensional ML settings, so a clean justification would be useful to people who want to trust long-run statistics from it. The author is careful to note that standard tools do not apply easily, which is fair given the memory and gradient noise terms. If the argument is correct and the conditions are not too restrictive, it supplies the kind of backing that practitioners sometimes ask for. The main weakness is exactly the one flagged in the stress test: no theorem statement appears with explicit hypotheses on the potential function, the step-size sequence, or the finite memory length. Without those, it is impossible to know whether the result covers the non-convex or only weakly convex targets that dominate modern sampling work, or whether it quietly requires strong convexity and constant steps that would narrow its scope. The abstract is too short to judge the technical details or the citation placement against earlier Langevin and SAGA analyses. This is the sort of paper that belongs in a probability journal that handles applied stochastic algorithms. Readers who work on sampling methods or who need to decide whether to run SAGA-LD for long simulations would get something concrete from it, provided the assumptions turn out reasonable. It is worth sending to referees so they can check the proof steps and the precise conditions; the abstract alone is too thin to decide on its own.

Referee Report

1 major / 0 minor

Summary. The manuscript proves that the SAGA-LD algorithm converges to a limiting distribution and satisfies a law of large numbers, using a model-specific method to handle its intricate dynamics that resist standard Markov chain theory.

Significance. A rigorous proof of ergodicity for SAGA-LD under conditions relevant to high-dimensional non-convex sampling would be significant for machine learning, as it would justify the algorithm's use with theoretical guarantees on convergence and averaging.

major comments (1)

The central claim rests on a model-specific proof whose assumptions on the target distribution (e.g., convexity or smoothness requirements), step-size schedule, and memory parameters are not stated explicitly enough to verify applicability to typical ML settings; without a clear theorem statement listing these conditions, the scope of the result cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting the need for greater clarity on the assumptions. We address the major comment below.

read point-by-point responses

Referee: The central claim rests on a model-specific proof whose assumptions on the target distribution (e.g., convexity or smoothness requirements), step-size schedule, and memory parameters are not stated explicitly enough to verify applicability to typical ML settings; without a clear theorem statement listing these conditions, the scope of the result cannot be assessed.

Authors: We agree that a single, self-contained theorem statement listing all assumptions would improve readability and allow readers to readily assess applicability. In the revised manuscript we will add an explicit theorem statement (placed prominently at the start of the main results section) that enumerates every condition required for the convergence and law-of-large-numbers results. This statement will include the precise requirements on the target distribution (any smoothness, convexity, or other regularity assumptions used in the proof), the step-size schedule, and the memory parameters of the SAGA-LD recursion. The model-specific character of the argument will be retained, as it is required to handle the non-standard dynamics that fall outside conventional Markov-chain frameworks, but the conditions themselves will be stated upfront and unambiguously. revision: yes

Circularity Check

0 steps flagged

No circularity: direct proof of ergodicity via model-specific method with no self-referential reductions.

full rationale

The paper claims a mathematical proof that SAGA-LD converges to a limiting distribution and satisfies a law of large numbers, using a model-specific method. No equations, parameters, or steps in the provided abstract or description reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. The derivation is presented as an independent argument rather than a renaming or tautological prediction, making the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the proof is described only at the level of 'model-specific method' with no further breakdown possible.

pith-pipeline@v0.9.0 · 5325 in / 1016 out tokens · 60166 ms · 2026-05-10T14:19:03.458890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

and MAJUMDAR, M

BHATTACHARYA, R. and MAJUMDAR, M. (1999). On a theorem of Dubins and Freedman.J. Theor. Probab.,121067–1087

work page 1999
[2]

BHATTACHARYA, R. N. and WAYMIRE, E. C. (2002). An approach to the existence of unique invariant probabilities for Markov processes.In: Limit Theorems in Probability and Statistics, János Bolyai Math. Soc., I (Balaton- lelle 1999), 181–200

work page 2002
[3]

BHATTACHARYA, R. N. and WAYMIRE, E. C. (2009).Stochastic Processes with Applications.SIAM, Philadelphia

work page 2009
[4]

and RÁSONYI, M

CARASSUS, L. and RÁSONYI, M. (2015). On Optimal Investment for a Be- havioural Investor in Multiperiod Incomplete Market Models.Math. Fi- nance,25115–153

work page 2015
[5]

and JORDAN, M

CHATTERJI, N., FLAMMARION, N., MA, Y., BARTLETT, B. and JORDAN, M. (2018). On the theory of variance reduction for stochastic gradient Monte Carlo.In: International Conference on Machine Learning, PMLR, 764–773

work page 2018
[6]

and LACOSTE-JULIEN, S

DEFAZIO, A., BACH, F. and LACOSTE-JULIEN, S. (2014). SAGA: A fast in- cremental gradient method with support for non-strongly convex composite objectives.Advances in Neural Information Processing Systems,27

work page 2014
[7]

and RÁSONYI, M

GERENCSÉR, B. and RÁSONYI, M. (2022). Invariant measures for multidi- mensional fractional stochastic volatility models.Stochastics and PDEs,10 1132–1164

work page 2022
[8]

HANSEN, B. (2019). A weak law of large numbers under weak mixing. Preprint.https://users.ssc.wisc.edu/∼bhansen/papers/wlln.pdf

work page 2019
[9]

LOVAS, A. (2025). Transition ofα-mixing in random iterations with ap- plications in queuing theory.Stochastic Processes and their Applications, 104803

work page 2025
[10]

MEYN, S. P. and TWEEDIE, R. L. (1993).Markov chains and stochastic stability.Springer-Verlag

work page 1993
[11]

and TEH, Y

WELLING, M. and TEH, Y. W. (2011). Bayesian learning via stochastic gra- dient Langevin dynamics.In: Proceedings of the 28th International Confer- ence on Machine Learning (ICML-11), 681–688, 2011

work page 2011
[12]

and GU, Q

ZOU, D., XU, P. and GU, Q. (2019). Sampling from non-log-concave distri- butions via variance-reduced gradient Langevin dynamics.In: 22nd Inter- national Conference on Artificial Intelligence and Statistics, PMLR, 2936– 2945, 2019. 13

work page 2019

[1] [1]

and MAJUMDAR, M

BHATTACHARYA, R. and MAJUMDAR, M. (1999). On a theorem of Dubins and Freedman.J. Theor. Probab.,121067–1087

work page 1999

[2] [2]

BHATTACHARYA, R. N. and WAYMIRE, E. C. (2002). An approach to the existence of unique invariant probabilities for Markov processes.In: Limit Theorems in Probability and Statistics, János Bolyai Math. Soc., I (Balaton- lelle 1999), 181–200

work page 2002

[3] [3]

BHATTACHARYA, R. N. and WAYMIRE, E. C. (2009).Stochastic Processes with Applications.SIAM, Philadelphia

work page 2009

[4] [4]

and RÁSONYI, M

CARASSUS, L. and RÁSONYI, M. (2015). On Optimal Investment for a Be- havioural Investor in Multiperiod Incomplete Market Models.Math. Fi- nance,25115–153

work page 2015

[5] [5]

and JORDAN, M

CHATTERJI, N., FLAMMARION, N., MA, Y., BARTLETT, B. and JORDAN, M. (2018). On the theory of variance reduction for stochastic gradient Monte Carlo.In: International Conference on Machine Learning, PMLR, 764–773

work page 2018

[6] [6]

and LACOSTE-JULIEN, S

DEFAZIO, A., BACH, F. and LACOSTE-JULIEN, S. (2014). SAGA: A fast in- cremental gradient method with support for non-strongly convex composite objectives.Advances in Neural Information Processing Systems,27

work page 2014

[7] [7]

and RÁSONYI, M

GERENCSÉR, B. and RÁSONYI, M. (2022). Invariant measures for multidi- mensional fractional stochastic volatility models.Stochastics and PDEs,10 1132–1164

work page 2022

[8] [8]

HANSEN, B. (2019). A weak law of large numbers under weak mixing. Preprint.https://users.ssc.wisc.edu/∼bhansen/papers/wlln.pdf

work page 2019

[9] [9]

LOVAS, A. (2025). Transition ofα-mixing in random iterations with ap- plications in queuing theory.Stochastic Processes and their Applications, 104803

work page 2025

[10] [10]

MEYN, S. P. and TWEEDIE, R. L. (1993).Markov chains and stochastic stability.Springer-Verlag

work page 1993

[11] [11]

and TEH, Y

WELLING, M. and TEH, Y. W. (2011). Bayesian learning via stochastic gra- dient Langevin dynamics.In: Proceedings of the 28th International Confer- ence on Machine Learning (ICML-11), 681–688, 2011

work page 2011

[12] [12]

and GU, Q

ZOU, D., XU, P. and GU, Q. (2019). Sampling from non-log-concave distri- butions via variance-reduced gradient Langevin dynamics.In: 22nd Inter- national Conference on Artificial Intelligence and Statistics, PMLR, 2936– 2945, 2019. 13

work page 2019