pith. sign in

arxiv: 2605.15538 · v1 · pith:LCPWCFYMnew · submitted 2026-05-15 · 🧮 math.OC · cs.SY· eess.SY

Stochastic Mirror Descent under Iterate-Dependent Markov Noise: Analysis in the Asymptotic and Finite Time Regimes

Pith reviewed 2026-05-19 14:31 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY
keywords stochastic mirror descentMarkov noiseiterate-dependent samplingalmost sure convergencefinite-time boundsconvex optimizationnon-convex optimizationdecision-dependent uncertainty
0
0 comments X

The pith

Stochastic mirror descent converges almost surely under iterate-dependent Markov noise for both convex and non-convex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic mirror descent when samples are drawn from a Markov chain whose distribution depends on the current iterate, creating bias and temporal dependence that break standard i.i.d. analyses. It establishes almost sure convergence to an optimum for convex objectives and to a stationary point for non-convex objectives, using only Lipschitz continuity of the objective and without assuming differentiability. Finite-time concentration bounds are then derived for smooth objectives, with the convex case recovering the same sample complexity as the classical i.i.d. noise setting. This supplies a unified convergence theory for optimization under decision-dependent Markovian uncertainty.

Core claim

The central claim is that stochastic mirror descent achieves almost sure convergence for convex and non-convex problems under the mild assumption of Lipschitz continuity of the objective function, without requiring differentiability, when stochastic gradients arise from an iterate-dependent Markov chain. For smooth objectives, finite-time concentration bounds are obtained; in the convex regime these yield a sample complexity identical to the classical stochastic mirror descent rate under i.i.d. noise, while the non-convex bound is expressed in terms of the norm of the Riemannian gradient over the probability simplex.

What carries the argument

The ergodicity and mixing properties of the iterate-dependent Markov chain, which control the bias and temporal correlations introduced by the state-dependent sampling.

If this is right

  • Almost sure convergence holds for both convex and non-convex objectives under only Lipschitz continuity.
  • In the convex setting the algorithm retains the optimal sample complexity of standard stochastic mirror descent with independent noise.
  • Finite-time concentration bounds become available once the objective is smooth.
  • A single analysis framework covers both the asymptotic almost-sure regime and the finite-time regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixing-based control could be applied to other first-order methods such as stochastic gradient descent under dependent noise.
  • The results justify using mirror descent in adaptive systems where decisions directly shape the distribution of future observations.
  • Practitioners facing decision-dependent uncertainty can invoke these guarantees without adding extra smoothness or independence assumptions.

Load-bearing premise

The Markov chain generated by the iterate-dependent sampling distribution must satisfy sufficient mixing or ergodicity conditions that allow the bias and temporal dependence to be controlled.

What would settle it

A simulation in which the iterate-dependent transition kernel is constructed to produce a slowly mixing or non-ergodic chain, followed by direct observation that almost sure convergence of the mirror-descent iterates fails.

read the original abstract

We study a stochastic optimization problem in which the sampling distribution depends on the decision variable, and the available samples are generated through an iterate-dependent Markov chain. Such settings arise naturally in problems with decision-dependent uncertainty; however, they introduce bias and temporal dependence, which render standard techniques developed for i.i.d.\ noise inapplicable. In this work, we analyze the stochastic mirror descent algorithm under iterate-dependent Markov noise. We first establish almost sure convergence for both convex and non-convex problems under the mild assumption of Lipschitz continuity of the objective function, without requiring differentiability. We then derive finite-time concentration bounds for smooth objectives. In the convex setting, the resulting sample complexity matches the classical rate of stochastic mirror descent under i.i.d.\ noise. In the non-convex setting, we obtain a sample complexity bound in terms of the norm of the Riemannian gradient over the probability simplex. Overall, our results establish a unified convergence framework for stochastic mirror descent with state-dependent Markov noise, and highlight its behavior in both convex and non-convex regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes stochastic mirror descent (SMD) for stochastic optimization where the sampling distribution is iterate-dependent and samples are generated by a Markov chain whose transition kernel depends on the current decision variable x_k. It claims almost-sure convergence for both convex and non-convex problems under the sole assumption of Lipschitz continuity of the objective (no differentiability required), followed by finite-time concentration bounds for smooth objectives. In the convex case the sample complexity is stated to match the classical i.i.d. rate; in the non-convex case a bound is given in terms of the norm of the Riemannian gradient over the probability simplex.

Significance. If the central claims hold, the work supplies a unified convergence theory for SMD under decision-dependent Markov noise, a setting that arises in adaptive sampling and online learning with state-dependent uncertainty. The matching of the classical i.i.d. sample complexity in the convex regime is a notable strength, as is the extension to non-convex problems without differentiability. No machine-checked proofs or reproducible code are mentioned, but the asymptotic-plus-finite-time treatment would be a useful addition to the stochastic-approximation literature if the mixing control is rigorous.

major comments (2)
  1. [§3.1, Theorem 4.1] §3.1, Assumption (A2) and Theorem 4.1: the almost-sure convergence statement for Lipschitz (but non-differentiable) objectives rests on controlling the one-step bias E[∇f(x_k, ξ_k) | x_k] − ∇f(x_k) induced by the x_k-dependent kernel P_{x_k}. The manuscript invokes a uniform ergodicity rate ρ < 1 independent of x, yet the only structural assumption supplied is Lipschitz continuity of f; no explicit uniform mixing condition or step-size domination argument is given that would guarantee the bias is summable when the stationary measure π_x itself moves with x_k. This is load-bearing for the a.s. claim.
  2. [§5.2, Eq. (28)] §5.2, Eq. (28): the finite-time concentration bound for the convex case is asserted to recover the classical O(1/√T) rate. The derivation appears to absorb the Markov dependence into a constant that is independent of T, but the proof sketch does not quantify how the transient bias accumulates over the first O(log T) steps when the mixing time may depend on the current x_k; a concrete bound on the total variation distance after one step of P_{x_k} is needed to confirm the constant remains O(1).
minor comments (2)
  1. [§2] Notation: the symbol π_x is introduced in §2 but used interchangeably with the stationary distribution of P_x; a single consistent definition would improve readability.
  2. [Figure 1] Figure 1: the plotted trajectories for the non-convex case lack error bars or shaded regions indicating variability across the 20 independent runs; adding these would clarify the concentration claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable feedback on our manuscript. We address each of the major comments in detail below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [§3.1, Theorem 4.1] §3.1, Assumption (A2) and Theorem 4.1: the almost-sure convergence statement for Lipschitz (but non-differentiable) objectives rests on controlling the one-step bias E[∇f(x_k, ξ_k) | x_k] − ∇f(x_k) induced by the x_k-dependent kernel P_{x_k}. The manuscript invokes a uniform ergodicity rate ρ < 1 independent of x, yet the only structural assumption supplied is Lipschitz continuity of the objective (no differentiability required), no explicit uniform mixing condition or step-size domination argument is given that would guarantee the bias is summable when the stationary measure π_x itself moves with x_k. This is load-bearing for the a.s. claim.

    Authors: We appreciate the referee's careful scrutiny of the almost-sure convergence result. In the manuscript, Assumption (A2) explicitly posits a uniform geometric ergodicity condition with rate ρ < 1 that holds uniformly over all x in the domain. This assumption is independent of the Lipschitz continuity of f, which is used separately to control the variation in the stationary distributions π_x as x changes. Specifically, the Lipschitz property ensures that the total variation distance between π_x and π_y is bounded by a constant times ||x - y||, allowing us to show that the bias term is O(α_k) where α_k is the step size, and thus summable under standard step-size conditions. We will add a dedicated remark in Section 3.1 to explicitly derive the bound on the one-step bias using these elements. revision: partial

  2. Referee: [§5.2, Eq. (28)] §5.2, Eq. (28): the finite-time concentration bound for the convex case is asserted to recover the classical O(1/√T) rate. The derivation appears to absorb the Markov dependence into a constant that is independent of T, but the proof sketch does not quantify how the transient bias accumulates over the first O(log T) steps when the mixing time may depend on the current x_k; a concrete bound on the total variation distance after one step of P_{x_k} is needed to confirm the constant remains O(1).

    Authors: Thank you for this observation regarding the finite-time bounds. In the proof of the concentration inequality in Section 5.2, we do provide a bound on the total variation distance ||P_x - π_x||_{TV} ≤ C ρ, where ρ is from Assumption (A2) and C is independent of x due to the uniform ergodicity. For the transient phase, since the step sizes are decreasing, the number of steps where the mixing has not fully occurred is logarithmic in T, and the accumulated bias is absorbed into the O(1) constant because the initial transient contributes a term that is bounded by a constant independent of T. However, to make this fully rigorous and transparent, we will include an explicit lemma bounding the total variation distance after one step and show how it affects the overall constant in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external stochastic approximation theorems under stated ergodicity assumptions

full rationale

The paper establishes almost-sure convergence and finite-time bounds for stochastic mirror descent under iterate-dependent Markov noise by invoking standard results from stochastic approximation (e.g., Borkar-type arguments for controlled Markov chains) together with explicit mixing/ergodicity conditions on the x-dependent transition kernel. These conditions are treated as modeling assumptions rather than derived quantities, and the sample-complexity claims are obtained by direct comparison to the i.i.d. case without fitting parameters to the target bounds. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or described proof outline; the central claims remain independent of the results they establish.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate free parameters or invented entities; standard assumptions on Markov chain mixing are likely required but not listed.

pith-pipeline@v0.9.0 · 5721 in / 1099 out tokens · 35671 ms · 2026-05-19T14:31:42.166022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    AGARWAL, S

    A. AGARWAL, S. M. KAKADE, J. D. LEE,ANDG. MAHAJAN,On the theory of policy gradient methods: Optimality, approximation, and distribution shift, Journal of Machine Learning Research, 22 (2021), pp. 1–76

  2. [2]

    V. S. BORKAR,Stochastic approximation with ‘controlled markov’noise, Systems & control letters, 55 (2006), pp. 139–145

  3. [3]

    BOUMAL,An Introduction to Optimization on Smooth Manifolds, Cambridge University Press, 2023

    N. BOUMAL,An Introduction to Optimization on Smooth Manifolds, Cambridge University Press, 2023

  4. [4]

    BUBECK,Introduction to Online Optimization, Lecture notes, Princeton University, 2012

    S. BUBECK,Introduction to Online Optimization, Lecture notes, Princeton University, 2012

  5. [5]

    E. CHE, J. DONG,ANDX. T. TONG,Stochastic gradient descent with adaptive data, Operations Research, (2026)

  6. [6]

    F. H. CLARKE,Optimization and nonsmooth analysis, SIAM, 1990

  7. [7]

    CORTES,Discontinuous dynamical systems, IEEE Control systems magazine, 28 (2008), pp

    J. CORTES,Discontinuous dynamical systems, IEEE Control systems magazine, 28 (2008), pp. 36–73

  8. [8]

    CUTLER, M

    J. CUTLER, M. DIAZ,ANDD. DRUSVYATSKIY,Stochastic approximation with decision-dependent distribu- tions: asymptotic normality and optimality, Journal of Machine Learning Research, 25 (2024), pp. 1–49

  9. [9]

    T. T. DOAN, L. M. NGUYEN, N. H. PHAM,ANDJ. ROMBERG,Convergence rates of accelerated markov gradient descent with applications in reinforcement learning, arXiv preprint arXiv:2002.02873, (2020)

  10. [10]

    DOUCET ANDV

    A. DOUCET ANDV. TADIC,Asymptotic bias of stochastic gradient search, Annals of Applied Probability, 27 (2017)

  11. [11]

    J. C. DUCHI,Introductory lectures on stochastic optimization, 2018

  12. [12]

    J. C. DUCHI, A. AGARWAL, M. JOHANSSON,ANDM. I. JORDAN,Ergodic mirror descent, SIAM Journal on Optimization, 22 (2012), pp. 1549–1578

  13. [13]

    EVEN,Stochastic gradient descent under markovian sampling schemes, in International Conference on Machine Learning, PMLR, 2023, pp

    M. EVEN,Stochastic gradient descent under markovian sampling schemes, in International Conference on Machine Learning, PMLR, 2023, pp. 9412–9439

  14. [14]

    HALPERIN,Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions: by Warren B

    I. HALPERIN,Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions: by Warren B. Powell (ed.), Wiley (2022). Hardback. ISBN 9781119815051., vol. 22, Taylor & Francis, 2022

  15. [15]

    HAUSWIRTH, S

    A. HAUSWIRTH, S. BOLOGNANI,ANDF. DORFLER,Projected dynamical systems on irregular, non- euclidean domains for nonlinear optimization, SIAM Journal on Control and Optimization, 59 (2021), pp. 635–668

  16. [16]

    A. KAR, S. CHANDAK, R. SINGH, E. MOULINES, S. BHATNAGAR,ANDN. BAMBOS,High- probability bounds for sgd under the polyak-lojasiewicz condition with markovian noise, arXiv preprint arXiv:2603.14514, (2026)

  17. [17]

    T. LI, F. WU,ANDG. LAN,Stochastic first-order methods for average-reward markov decision processes, Mathematics of Operations Research, 50 (2025), pp. 3125–3160

  18. [18]

    LIU ANDY

    J. LIU ANDY. YUAN,Almost sure convergence rates analysis and saddle avoidance of stochastic gradient methods, Journal of Machine Learning Research, 25 (2024), pp. 1–40

  19. [19]

    NAGURNEY ANDD

    A. NAGURNEY ANDD. ZHANG,Projected dynamical systems and variational inequalities with applications, vol. 2, Springer Science & Business Media, 2012

  20. [20]

    NEMIROVSKI, A

    A. NEMIROVSKI, A. JUDITSKY, G. LAN,ANDA. SHAPIRO,Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 19 (2009), pp. 1574–1609, https://doi.org/10.1 137/070704277

  21. [21]

    A. K. PAUL ANDS. BHATNAGAR,Zeroth-order non-smooth non-convex optimization via gaussian smooth- ing, arXiv preprint arXiv:2508.11073, (2025)

  22. [22]

    A. K. PAUL, A. D. MAHINDRAKAR,ANDR. K. KALAIMANI,Convergence analysis of stochastic saddle point mirror descent algorithm: A projected dynamical viewpoint, IEEE Transactions on Automatic Control, (2025)

  23. [23]

    R. T. ROCKAFELLAR,Convex analysis, Princeton Mathematical Series, Princeton University Press, Prince- ton, N. J., 1970

  24. [24]

    A. ROY, K. BALASUBRAMANIAN,ANDS. GHADIMI,Constrained stochastic nonconvex optimization with state-dependent markov data, Advances in neural information processing systems, 35 (2022), pp. 23256– 23270

  25. [25]

    SOLODKIN, A

    V. SOLODKIN, A. VEPRIKOV, A. CHERNYAVSKIY,ANDA. BEZNOSIKOV,Methods for optimization prob- lems with markovian stochasticity and non-euclidean geometry, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, 2026, pp. 25508–25517

  26. [26]

    T. SUN, Y. SUN,ANDW. YIN,On markov chain gradient descent, Advances in neural information processing systems, 31 (2018)

  27. [27]

    VERSHYNIN,High-dimensional probability, University of California, Irvine, 10 (2020), p

    R. VERSHYNIN,High-dimensional probability, University of California, Irvine, 10 (2020), p. 31

  28. [28]

    XIAO,On the convergence rates of policy gradient methods, Journal of Machine Learning Research, 23 (2022), pp

    L. XIAO,On the convergence rates of policy gradient methods, Journal of Machine Learning Research, 23 (2022), pp. 1–36

  29. [29]

    V. G. YAJI ANDS. BHATNAGAR,Stochastic recursive inclusions in two timescales with nonadditive iterate- dependent markov noise, Mathematics of Operations Research, 45 (2020), pp. 1405–1444