arxiv: 2604.06350 · v3 · submitted 2026-04-07 · 🧮 math.OC

Recognition: no theorem link

Convergence of Riemannian Stochastic Gradient Descents: Varying Batch Sizes And Nonstandard Batch Forming

Hao Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:35 UTC · model grok-4.3

classification 🧮 math.OC

keywords Riemannian optimizationstochastic gradient descentconvergence analysisvarying batch sizesbatch forming schemesmanifold optimization

0 comments

The pith

Riemannian stochastic gradient descent converges under varying probability spaces at each iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes convergence theorems for Riemannian stochastic gradient descent where the probability distributions used for gradient estimates can change from one step to the next. This covers cases like increasing batch sizes or using different unbiased sampling methods over time. If these theorems hold, practitioners gain more flexibility in designing optimization algorithms on manifolds without sacrificing guaranteed convergence. Readers care because many practical implementations already use dynamic batching for speed or stability, but lacked rigorous backing until now.

Core claim

We establish convergence theorems for Riemannian stochastic gradient descents in which the underlying probability spaces vary from iteration to iteration. As applications, we deduce convergence results for Riemannian stochastic gradient descents with varying batch sizes and unbiased batch forming schemes.

What carries the argument

The sequence of probability measures that vary across iterations, required to deliver unbiased gradient estimates with controlled variance or Lipschitz properties on the manifold.

Load-bearing premise

That the sequence of probability measures provides unbiased gradient estimates with bounded variance or satisfies appropriate Lipschitz conditions at each iteration.

What would settle it

A concrete counterexample on a manifold where unbiased but varying batch sizes cause the iterates to diverge despite satisfying all other stated conditions.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript establishes general convergence theorems for Riemannian stochastic gradient descent under a sequence of iteration-dependent probability measures μ_k that are assumed to yield unbiased stochastic gradients and controlled variance in the tangent spaces. It then deduces convergence for the special cases of varying batch sizes and nonstandard unbiased batch-forming schemes by claiming these schemes satisfy the general hypotheses.

Significance. If the general theorems are correctly proved under the stated assumptions on the measures μ_k, the work would supply a useful abstract framework for handling non-stationary sampling in Riemannian optimization. This could facilitate analysis of practical variants such as adaptive batching on manifolds. No machine-checked proofs or parameter-free derivations are present, so the contribution rests entirely on the analytic arguments.

major comments (1)

[§4] §4 (Applications to varying batch sizes and nonstandard batch forming): the deduction that these schemes satisfy the unbiasedness condition E_{μ_k}[G(x_k, ξ_k)] = grad f(x_k) and the second-moment bound in T_{x_k}M is asserted without explicit verification or calculation. Because the Riemannian setting requires all expectations to be taken with respect to the tangent space at the current point (with possible parallel transport for cross-iteration comparisons), any failure of unbiasedness under non-i.i.d. batch rules would invalidate the supermartingale argument used for convergence. This verification is load-bearing for the claimed applications.

minor comments (2)

The statement of the general assumptions on the sequence {μ_k} (e.g., summability of variance terms) could be collected in a single numbered list for easier reference when checking the applications.
Notation for the stochastic gradient G and its norm in the tangent space is occasionally inconsistent between the general theorems and the batch-size examples; a uniform symbol or explicit reminder about parallel transport would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the single major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (Applications to varying batch sizes and nonstandard batch forming): the deduction that these schemes satisfy the unbiasedness condition E_{μ_k}[G(x_k, ξ_k)] = grad f(x_k) and the second-moment bound in T_{x_k}M is asserted without explicit verification or calculation. Because the Riemannian setting requires all expectations to be taken with respect to the tangent space at the current point (with possible parallel transport for cross-iteration comparisons), any failure of unbiasedness under non-i.i.d. batch rules would invalidate the supermartingale argument used for convergence. This verification is load-bearing for the claimed applications.

Authors: We agree that the manuscript asserts without explicit calculation that the varying-batch-size and nonstandard-batch-forming schemes satisfy the unbiasedness and second-moment hypotheses of the general theorems. In the revised version we will add the missing verifications in §4. For varying batch sizes we will show directly that the batch average G is an unbiased estimator of grad f(x_k) in T_{x_k}M because each summand is drawn independently from the same data distribution; the second-moment bound then follows from the per-sample variance assumption. For the nonstandard batch-forming schemes we will compute E_{μ_k}[G] explicitly from the definition of μ_k and confirm it equals grad f(x_k), together with the corresponding second-moment control. These calculations will be performed entirely within the tangent space at the current point x_k, so no parallel transport between iterations is required for the per-step conditions. The added details will make the application of the supermartingale argument fully rigorous. revision: yes

Circularity Check

0 steps flagged

Direct mathematical convergence proofs without self-referential reductions

full rationale

The paper establishes general convergence theorems for Riemannian SGD where the probability measures vary per iteration, under assumptions that ensure unbiased gradients and controlled variance. The applications to varying batch sizes and nonstandard unbiased batch forming are direct consequences of these theorems, as the schemes are stated to satisfy the required unbiasedness. No steps reduce by construction to inputs, no self-citations are load-bearing for the central result, and the derivation is self-contained as a standard extension of stochastic approximation arguments to the Riemannian setting with time-varying measures.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the ledger therefore records the minimal background needed for any such convergence statement.

axioms (1)

domain assumption The objective function is sufficiently smooth and the Riemannian manifold satisfies standard completeness and curvature conditions.
Typical background for Riemannian optimization convergence proofs; not explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5313 in / 1047 out tokens · 29853 ms · 2026-05-10T18:35:46.658308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Absil, R

P.-A. Absil, R. Mahony, R. Sepulchre,Optimization Algorithms on Matrix Manifolds,Princeton University Press, ISBN-13: 978-0691132983, ISBN-10: 0691132984, https://www.jstor.org/stable/j.ctt7smmk

work page
[2]

Adachi S

M. Adachi S. Hayakawa, M. Jørgensen, X. Wan, V. Nguyen, H. Oberhauser, M.A. Osborne,Adaptive Batch Sizes for Active Learning: A Probabilistic Numerics Approach,Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain. PMLR: Volume 238

work page 2024
[3]

Ya. I. Alber, A. N. Iusem, M. V Solodov,On the projected subgradient method for nonsmooth convex optimization in a hilbert space,Mathematical Programming, 81(1):23–35, 1998

work page 1998
[4]

X. An, L. Shen, Y. Luo, H. Hu, D. Tao,Adaptive Batch Size Time Evolving Stochastic Gradient Descent for Federated Learning,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1158-1170, Feb. 2026, doi: 10.1109/TPAMI.2025.3610169

work page doi:10.1109/tpami.2025.3610169 2026
[5]

Bertsekas, J

D. Bertsekas, J. Tsitsiklis,Gradient convergence in gradient methods,https://dspace.mit.edu/handle/1721.1/3462

work page
[6]

Bonnabel,Stochastic Gradient Descent on Riemannian Manifolds,IEEE Transactions on Automatic Control, Volume 58, Issue 9, September 2013

S. Bonnabel,Stochastic Gradient Descent on Riemannian Manifolds,IEEE Transactions on Automatic Control, Volume 58, Issue 9, September 2013

work page 2013
[7]

L. Bottou,Online Learning and Stochastic Approximations,Online Learning and Neural Networks, Cambridge University Press, Cambridge, UK, 1998, https://leon.bottou.org/papers/bottou-98x 16

work page 1998
[8]

Lénaïc Chizat, Edouard Oyallon, and Francis Bach

L. Bottou, F.E. Curtis, J. Nocedal,Optimization Methods for Large-Scale Machine Learning,SIAM Review Vol. 60, Iss. 2 (2018),10.1137/16M1080173

work page doi:10.1137/16m1080173 2018
[9]

S. De, A. Yadav, D. Jacobs, T. Goldstein,Automated Inference with Adaptive Batches,Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54

work page 2017
[10]

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , Year =

A. Devarakonda, M. Naumov, M. Garland,ADABATCH: ADAPTIVE BATCH SIZES FOR TRAINING DEEP NEURAL NETWORKS,arXiv:1712.02029

work page arXiv
[11]

K. Kamo, H. Iiduka,Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum, arXiv:2501.08883

work page arXiv
[12]

X. Li, F. Orabona,On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes,Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89

work page 2019
[13]

Tim. Lau, W. Li, C.i Xu, H. Liu, M. Kolar,Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism,Second Conference on Parsimony and Learning (CPAL 2025), arXiv:2412.21124

work page arXiv 2025
[14]

J. Liu, M. Takac,Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme under Weak Strong Con- vexity Assumption,arXiv:1612.05356

work page arXiv
[15]

J. Liu, L. Xu,Accelerating Stochastic Gradient Descent Using Antithetic Sampling,arXiv:1810.03124

work page Pith review arXiv
[16]

Mairal,Stochastic majorization-minimization algorithms for large-scale optimization,Advances in Neural Information Processing Systems, pages 2283–2291, 2013

J. Mairal,Stochastic majorization-minimization algorithms for large-scale optimization,Advances in Neural Information Processing Systems, pages 2283–2291, 2013

work page 2013
[17]

Ostroukhov, A

P. Ostroukhov, A. Zhumabayeva, C. Xiang, A. Gasnikov, M. Takac, D. Kamzolov,AdaBatchGrad: combining adaptive batch size and adaptive step size,IMA Journal of Numerical Analysis, draf081, https://doi.org/10.1093/imanum/draf081

work page doi:10.1093/imanum/draf081
[18]

X. Peng, L. Li, F. Wang,Accelerating Minibatch Stochastic Gradient Descent using Typicality Sampling,arXiv:1903.04192

work page arXiv 1903
[19]

M. P. Perrone, H. Khan, C. Kim, A. Kyrillidis, J. Quinn, V. Salapura,Optimal Mini-Batch Size Selection for Fast Gradient Descent,arXiv:1911.06459

work page arXiv 1911
[20]

X. Qian, D. KlabjanThe Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent, arXiv:2004.13146

work page arXiv 2004
[21]

Shah,Improving the convergence of SGD through adaptive batch sizes,arXiv:1910.08222

S.t Sievert, S. Shah,Improving the convergence of SGD through adaptive batch sizes,arXiv:1910.08222

work page arXiv 1910
[22]

Umeda, H

H. Umeda, H. Iiduka,Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent, arXiv:2409.08770

work page arXiv
[23]

Umeda, H

H. Umeda, H. Iiduka,Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Mini- mization of Stochastic First-order Oracle Complexity,arXiv:2508.05302

work page arXiv
[24]

C. Xu, P. Yang, H. Wu,Weighted Low-Rank Approximation via Confined Stochastic Gradient Descents on Manifolds, arXiv:2502.14174

work page arXiv
[25]

C. Xu, H. Wu,Mini-Batch Stochastic Gradient Descents on Manifolds,in preparation

work page
[26]

Peiqi Yang, private communication

work page
[27]

P. Yang, C. Xu, H. Wu,Adaptive Stochastic Gradient Descents on Manifolds with an Application on Weighted Low-Rank Approximation,arXiv:2503.11833

work page arXiv
[28]

Determinantal point processes for mini-batch diversification.arXiv preprint arXiv:1705.00607, 2017

C. Zhang, H. Kjellstrom, S. Mandt,Determinantal Point Processes for Mini-Batch Diversification,arXiv:1705.00607

work page arXiv
[29]

P. Zhao, T. Zhang,Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling,arXiv:1405.3080. Department of Mathematics, The George W ashington University, Phillips Hall, Room 739, 801 22nd Street, N.W., W ashington DC 20052, USA. Telephone: 1-202-994-0653, F ax: 1-202-994-6760 Email address:haowu@gwu.edu 17

work page Pith review arXiv