Recognition: no theorem link
Convergence of Riemannian Stochastic Gradient Descents: Varying Batch Sizes And Nonstandard Batch Forming
Pith reviewed 2026-05-10 18:35 UTC · model grok-4.3
The pith
Riemannian stochastic gradient descent converges under varying probability spaces at each iteration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish convergence theorems for Riemannian stochastic gradient descents in which the underlying probability spaces vary from iteration to iteration. As applications, we deduce convergence results for Riemannian stochastic gradient descents with varying batch sizes and unbiased batch forming schemes.
What carries the argument
The sequence of probability measures that vary across iterations, required to deliver unbiased gradient estimates with controlled variance or Lipschitz properties on the manifold.
Load-bearing premise
That the sequence of probability measures provides unbiased gradient estimates with bounded variance or satisfies appropriate Lipschitz conditions at each iteration.
What would settle it
A concrete counterexample on a manifold where unbiased but varying batch sizes cause the iterates to diverge despite satisfying all other stated conditions.
read the original abstract
We establish convergence theorems for Riemannian stochastic gradient descents in which the underlying probability spaces vary from iteration to iteration. As applications, we deduce convergence results for Riemannian stochastic gradient descents with varying batch sizes and unbiased batch forming schemes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript establishes general convergence theorems for Riemannian stochastic gradient descent under a sequence of iteration-dependent probability measures μ_k that are assumed to yield unbiased stochastic gradients and controlled variance in the tangent spaces. It then deduces convergence for the special cases of varying batch sizes and nonstandard unbiased batch-forming schemes by claiming these schemes satisfy the general hypotheses.
Significance. If the general theorems are correctly proved under the stated assumptions on the measures μ_k, the work would supply a useful abstract framework for handling non-stationary sampling in Riemannian optimization. This could facilitate analysis of practical variants such as adaptive batching on manifolds. No machine-checked proofs or parameter-free derivations are present, so the contribution rests entirely on the analytic arguments.
major comments (1)
- [§4] §4 (Applications to varying batch sizes and nonstandard batch forming): the deduction that these schemes satisfy the unbiasedness condition E_{μ_k}[G(x_k, ξ_k)] = grad f(x_k) and the second-moment bound in T_{x_k}M is asserted without explicit verification or calculation. Because the Riemannian setting requires all expectations to be taken with respect to the tangent space at the current point (with possible parallel transport for cross-iteration comparisons), any failure of unbiasedness under non-i.i.d. batch rules would invalidate the supermartingale argument used for convergence. This verification is load-bearing for the claimed applications.
minor comments (2)
- The statement of the general assumptions on the sequence {μ_k} (e.g., summability of variance terms) could be collected in a single numbered list for easier reference when checking the applications.
- Notation for the stochastic gradient G and its norm in the tangent space is occasionally inconsistent between the general theorems and the batch-size examples; a uniform symbol or explicit reminder about parallel transport would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the single major comment below and describe the revisions we will make.
read point-by-point responses
-
Referee: [§4] §4 (Applications to varying batch sizes and nonstandard batch forming): the deduction that these schemes satisfy the unbiasedness condition E_{μ_k}[G(x_k, ξ_k)] = grad f(x_k) and the second-moment bound in T_{x_k}M is asserted without explicit verification or calculation. Because the Riemannian setting requires all expectations to be taken with respect to the tangent space at the current point (with possible parallel transport for cross-iteration comparisons), any failure of unbiasedness under non-i.i.d. batch rules would invalidate the supermartingale argument used for convergence. This verification is load-bearing for the claimed applications.
Authors: We agree that the manuscript asserts without explicit calculation that the varying-batch-size and nonstandard-batch-forming schemes satisfy the unbiasedness and second-moment hypotheses of the general theorems. In the revised version we will add the missing verifications in §4. For varying batch sizes we will show directly that the batch average G is an unbiased estimator of grad f(x_k) in T_{x_k}M because each summand is drawn independently from the same data distribution; the second-moment bound then follows from the per-sample variance assumption. For the nonstandard batch-forming schemes we will compute E_{μ_k}[G] explicitly from the definition of μ_k and confirm it equals grad f(x_k), together with the corresponding second-moment control. These calculations will be performed entirely within the tangent space at the current point x_k, so no parallel transport between iterations is required for the per-step conditions. The added details will make the application of the supermartingale argument fully rigorous. revision: yes
Circularity Check
Direct mathematical convergence proofs without self-referential reductions
full rationale
The paper establishes general convergence theorems for Riemannian SGD where the probability measures vary per iteration, under assumptions that ensure unbiased gradients and controlled variance. The applications to varying batch sizes and nonstandard unbiased batch forming are direct consequences of these theorems, as the schemes are stated to satisfy the required unbiasedness. No steps reduce by construction to inputs, no self-citations are load-bearing for the central result, and the derivation is self-contained as a standard extension of stochastic approximation arguments to the Riemannian setting with time-varying measures.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The objective function is sufficiently smooth and the Riemannian manifold satisfies standard completeness and curvature conditions.
Reference graph
Works this paper leans on
- [1]
-
[2]
M. Adachi S. Hayakawa, M. Jørgensen, X. Wan, V. Nguyen, H. Oberhauser, M.A. Osborne,Adaptive Batch Sizes for Active Learning: A Probabilistic Numerics Approach,Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain. PMLR: Volume 238
work page 2024
-
[3]
Ya. I. Alber, A. N. Iusem, M. V Solodov,On the projected subgradient method for nonsmooth convex optimization in a hilbert space,Mathematical Programming, 81(1):23–35, 1998
work page 1998
-
[4]
X. An, L. Shen, Y. Luo, H. Hu, D. Tao,Adaptive Batch Size Time Evolving Stochastic Gradient Descent for Federated Learning,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1158-1170, Feb. 2026, doi: 10.1109/TPAMI.2025.3610169
-
[5]
D. Bertsekas, J. Tsitsiklis,Gradient convergence in gradient methods,https://dspace.mit.edu/handle/1721.1/3462
-
[6]
S. Bonnabel,Stochastic Gradient Descent on Riemannian Manifolds,IEEE Transactions on Automatic Control, Volume 58, Issue 9, September 2013
work page 2013
-
[7]
L. Bottou,Online Learning and Stochastic Approximations,Online Learning and Neural Networks, Cambridge University Press, Cambridge, UK, 1998, https://leon.bottou.org/papers/bottou-98x 16
work page 1998
-
[8]
Lénaïc Chizat, Edouard Oyallon, and Francis Bach
L. Bottou, F.E. Curtis, J. Nocedal,Optimization Methods for Large-Scale Machine Learning,SIAM Review Vol. 60, Iss. 2 (2018),10.1137/16M1080173
-
[9]
S. De, A. Yadav, D. Jacobs, T. Goldstein,Automated Inference with Adaptive Batches,Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54
work page 2017
-
[10]
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , Year =
A. Devarakonda, M. Naumov, M. Garland,ADABATCH: ADAPTIVE BATCH SIZES FOR TRAINING DEEP NEURAL NETWORKS,arXiv:1712.02029
- [11]
-
[12]
X. Li, F. Orabona,On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes,Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89
work page 2019
- [13]
- [14]
-
[15]
J. Liu, L. Xu,Accelerating Stochastic Gradient Descent Using Antithetic Sampling,arXiv:1810.03124
-
[16]
J. Mairal,Stochastic majorization-minimization algorithms for large-scale optimization,Advances in Neural Information Processing Systems, pages 2283–2291, 2013
work page 2013
-
[17]
P. Ostroukhov, A. Zhumabayeva, C. Xiang, A. Gasnikov, M. Takac, D. Kamzolov,AdaBatchGrad: combining adaptive batch size and adaptive step size,IMA Journal of Numerical Analysis, draf081, https://doi.org/10.1093/imanum/draf081
- [18]
- [19]
- [20]
-
[21]
Shah,Improving the convergence of SGD through adaptive batch sizes,arXiv:1910.08222
S.t Sievert, S. Shah,Improving the convergence of SGD through adaptive batch sizes,arXiv:1910.08222
- [22]
- [23]
- [24]
-
[25]
C. Xu, H. Wu,Mini-Batch Stochastic Gradient Descents on Manifolds,in preparation
-
[26]
Peiqi Yang, private communication
- [27]
-
[28]
Determinantal point processes for mini-batch diversification.arXiv preprint arXiv:1705.00607, 2017
C. Zhang, H. Kjellstrom, S. Mandt,Determinantal Point Processes for Mini-Batch Diversification,arXiv:1705.00607
-
[29]
P. Zhao, T. Zhang,Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling,arXiv:1405.3080. Department of Mathematics, The George W ashington University, Phillips Hall, Room 739, 801 22nd Street, N.W., W ashington DC 20052, USA. Telephone: 1-202-994-0653, F ax: 1-202-994-6760 Email address:haowu@gwu.edu 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.