pith. machine review for the scientific record. sign in

arxiv: 2604.10860 · v1 · submitted 2026-04-12 · 🧮 math.OC

Recognition: unknown

Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3

classification 🧮 math.OC
keywords stochastic gradient descentinfinite-dimensional Hilbert spacestochastic differential equationcylindrical Brownian motiondiffusion approximationweak convergenceinverse problemsoptimization
0
0 comments X

The pith

Stochastic gradient descent in infinite-dimensional Hilbert spaces approximates a cylindrical Brownian motion SDE with second-order weak error in the step size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the discrete iterations of stochastic gradient descent over a Hilbert space can be approximated in the small-step-size regime by solutions of a stochastic differential equation driven by cylindrical Brownian motion. This limit is proved in a weak sense by comparing the expectation of smooth test functionals applied to the discrete and continuous trajectories, yielding a discrepancy of order two in the step size. The result matters because inverse problems in scientific computing frequently require optimization over function spaces, where direct analysis of the discrete algorithm is intractable while the continuous limit admits tools from stochastic analysis. Extending the Euclidean case requires proving well-posedness of the SDE under structural conditions on the noise covariance and replacing strong convergence arguments with weak ones because the sampling noise is discrete rather than Gaussian.

Core claim

The discrete dynamics of SGD in infinite-dimensional Hilbert spaces can be approximated by an SDE driven by cylindrical Brownian motion. The analysis extends diffusion-approximation results from Euclidean spaces by addressing two difficulties: establishing well-posedness of the stochastic evolution equation through structural conditions on the covariance operator, and performing the comparison in a weak sense via a suitable class of smooth functionals on the Hilbert space. The discrepancy between SGD and the limiting SDE, when evaluated through these functionals, is of second order in the step size.

What carries the argument

The limiting stochastic evolution equation driven by cylindrical Brownian motion, together with the class of smooth functionals used to measure weak discrepancy of order two in the step size.

If this is right

  • The continuous-time limit permits the use of stochastic analysis tools to study SGD behavior in infinite-dimensional optimization problems.
  • Numerical experiments confirm the predicted second-order weak convergence behavior between the discrete and continuous dynamics.
  • The framework directly extends previous diffusion approximations that were restricted to Euclidean spaces.
  • The result applies to inverse problems whose unknowns lie in Hilbert spaces, such as those arising in scientific computing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The second-order weak approximation could be used to derive explicit convergence rates or stability estimates for SGD by analyzing the associated SDE instead of the discrete recursion.
  • Similar diffusion limits may exist for other stochastic optimization schemes when the parameter space is infinite-dimensional.
  • The technique of testing against smooth functionals could be adapted to obtain quantitative error bounds in related settings such as stochastic approximation on manifolds or Banach spaces.

Load-bearing premise

The covariance operator must satisfy appropriate structural conditions so that the stochastic evolution equation driven by cylindrical Brownian motion is well-posed.

What would settle it

A numerical experiment in which the weak error between SGD trajectories and the SDE solution fails to decrease quadratically with the step size, or a counterexample in which the SDE ceases to be well-posed once the covariance conditions are removed, would falsify the central approximation result.

Figures

Figures reproduced from arXiv: 2604.10860 by Anjali Nair, Jaeyoung Yoon, Qin Li, Sandra Cerrai.

Figure 1
Figure 1. Figure 1: (Example 1, MC SGD vs. SME) Weak error between the SGD iterate and the exact SME expectation estimated using N = 5×106 Monte Carlo samples. Results are shown for different truncation dimensions D of the Hilbert space. The dotted line indicates the reference slope O(η). To validate the theory and visualize the convergence rate in η, we ensure that the impact of the Monte Carlo sampling error is negligible. … view at source ↗
Figure 2
Figure 2. Figure 2: (Example 1, MC discretized SME vs. SME) Convergence of the Monte Carlo esti￾mator with respect to the number of trials N for different stepsizes η, averaged over 50 independent runs. For small N, the error decreases at the Monte Carlo rate O(N −1/2 ) till the error saturates and plateaus at a place where discretization error dominates for large N. The dashed line indicates the reference slope O(N −1/2 ). T… view at source ↗
Figure 3
Figure 3. Figure 3: (Example 1, Weak error in the homogeneous noise Setting) The plots show the weak error (30) as a function of the step size η for truncation dimensions D ∈ {5, 10, 20, 30}. Error bars in the left panel represent the Monte Carlo standard error (MCSE) computed from 5 × 106 samples. The dashed line indicates the reference slope O(η 2 ). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Example 2: Target function) Heatmap of the target function ϕ¯ used in the inverse problem, visualized using a truncation level K = 20. We initialize at ϕ0 = 0 and use the test function g(ϕ) = ∥ϕ∥ 4 to evaluate the weak error. To demonstrate that the convergence is uniform with respect to discretization, we truncate H using the first K × K modes, for K ∈ {2, 3, 4, 6, 8}. To illustrate the effect of the tru… view at source ↗
Figure 5
Figure 5. Figure 5: (Example 2: Gradient of the objective functional) Heatmap of ∇F(ϕ) evaluated at ϕ = 0, reconstructed using different truncation levels K. K = 2 K = 4 K = 8 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (Example 2: Stochastic gradient) Heatmap of the stochastic gradient ∇Fx(ϕ) eval￾uated at ϕ = 0 for a sampled location x (orange dot). The plots show the reconstruction using different truncation levels K. prediction of Theorem 4.5. Notably, as K increases, the error curves remain consistent and do not show any degradation in the convergence rate. This suggests that the O(η 2 ) scaling is robust to the dime… view at source ↗
Figure 7
Figure 7. Figure 7: (Example 2: Weak error in the inhomogeneous noise setting) The plot shows the difference between the SGD iterate and the discretized SME as a function of the step size η for different truncation levels K. The dashed line indicates the reference slope O(η 2 ). All results are obtained using 106 Monte Carlo trials with smoothing parameter ε = 0.1. Original Image Resampled Image Projected Target [PITH_FULL_I… view at source ↗
Figure 8
Figure 8. Figure 8: From left to right: the standard cameraman test image on a 512 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Weak error between SGD and the discretized SME for the cameraman target. The [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A single sample-path illustration of the reconstruction dynamics for the cameraman [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

Inverse problems in scientific computing often require optimization over infinite-dimensional Hilbert spaces. A commonly used solver in such settings is stochastic gradient descent (SGD), where gradients are approximated using randomly sampled sub-objective functions. In this work we study the continuous-time limit of SGD in the small step-size regime. We show that the discrete dynamics can be approximated by a stochastic differential equation (SDE) driven by cylindrical Brownian motion. The analysis extends diffusion-approximation results previously established in Euclidean spaces to the infinite-dimensional setting. Two analytical difficulties arise in this extension. First, the cylindrical nature of the noise requires establishing well-posedness of the resulting stochastic evolution equation through appropriate structural conditions on the covariance operator. Second, since the randomness in SGD originates from discrete sampling while the limiting equation is driven by Gaussian noise, the comparison between the two dynamics must be carried out in a weak sense. We therefore introduce a suitable class of smooth functionals on the Hilbert space and prove that the discrepancy between SGD and the limiting SDE, when evaluated through these functionals, is of second order in the step size. Numerical experiments confirm the predicted convergence behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends diffusion-approximation results for stochastic gradient descent (SGD) from Euclidean spaces to infinite-dimensional Hilbert spaces. It claims that the discrete SGD iterates can be approximated by a stochastic evolution equation driven by cylindrical Brownian motion, with the discrepancy between the discrete dynamics and the limiting SDE being of second order in the step size when evaluated on expectations of a suitable class of smooth test functionals. The analysis requires establishing well-posedness of the limiting equation under structural conditions on the covariance operator and proceeds via weak-convergence arguments to handle the mismatch between discrete sampling noise and Gaussian driving noise.

Significance. If the central derivations hold, the result supplies a rigorous continuous-time limit for SGD in function spaces, which is directly relevant to optimization arising in inverse problems and PDE-constrained settings. The work correctly identifies the two technical obstacles (cylindrical noise and weak-sense comparison) and supplies the necessary technical machinery; the second-order weak approximation on smooth functionals is a concrete strengthening of first-order limits that appear in the Euclidean literature.

major comments (2)
  1. [Abstract and well-posedness section] Abstract and the well-posedness section: the structural conditions imposed on the covariance operator to guarantee existence of the stochastic convolution with cylindrical Brownian motion are stated but never verified for the noise that actually arises from random sampling of sub-objectives. In typical inverse-problem settings the covariance is only bounded or compact; without an explicit check that it maps into the required Hilbert-Schmidt or trace-class space after regularization, the limiting SDE is not known to be well-posed and the subsequent weak-convergence argument cannot be invoked.
  2. [Main approximation theorem] The second-order claim for the weak error: the manuscript asserts that the discrepancy, when tested against smooth functionals, is O(h^2) where h is the step size. Because the full proof is not reproduced in the supplied abstract, it is impossible to confirm that the Itô-Taylor expansion or generator comparison used to obtain the second-order term remains valid once the cylindrical noise and the infinite-dimensional geometry are taken into account; an explicit statement of the precise regularity assumed on the test functionals and on the drift/diffusion coefficients is needed.
minor comments (2)
  1. [Notation and assumptions] The precise definition of the class of admissible test functionals (e.g., the required Fréchet differentiability order and growth conditions) should be stated once in a dedicated notation subsection rather than scattered across lemmas.
  2. [Numerical section] Numerical experiments are mentioned but the discretization of the Hilbert space and the approximation of the cylindrical noise are not described; adding a short paragraph on the finite-dimensional truncation used would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive major comments. These have highlighted areas where additional clarification will improve the manuscript. We respond to each point below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract and well-posedness section] Abstract and the well-posedness section: the structural conditions imposed on the covariance operator to guarantee existence of the stochastic convolution with cylindrical Brownian motion are stated but never verified for the noise that actually arises from random sampling of sub-objectives. In typical inverse-problem settings the covariance is only bounded or compact; without an explicit check that it maps into the required Hilbert-Schmidt or trace-class space after regularization, the limiting SDE is not known to be well-posed and the subsequent weak-convergence argument cannot be invoked.

    Authors: We agree that the structural conditions on the covariance operator (ensuring the stochastic convolution exists as a mild solution) are stated as assumptions without an explicit verification tied to the random-sampling noise. In the manuscript these conditions are formulated in a general form that applies once the second-moment operator of the gradient noise satisfies the required Hilbert-Schmidt or trace-class property. For typical inverse-problem settings the covariance is indeed compact, but the sampling of sub-objectives often incorporates smoothing from the forward operator or regularization, which upgrades the covariance to the necessary class. To make this transparent we will add a dedicated remark in the well-posedness section that verifies the conditions under standard assumptions on the loss (Lipschitz gradients) and uniform random sampling; an illustrative example with a compact forward operator will be included. This revision directly addresses the applicability concern while preserving the generality of the framework. revision: yes

  2. Referee: [Main approximation theorem] The second-order claim for the weak error: the manuscript asserts that the discrepancy, when tested against smooth functionals, is O(h^2) where h is the step size. Because the full proof is not reproduced in the supplied abstract, it is impossible to confirm that the Itô-Taylor expansion or generator comparison used to obtain the second-order term remains valid once the cylindrical noise and the infinite-dimensional geometry are taken into account; an explicit statement of the precise regularity assumed on the test functionals and on the drift/diffusion coefficients is needed.

    Authors: The complete proof of the second-order weak error appears in Sections 3–4 of the full manuscript and proceeds via an Itô-Taylor expansion of the continuous process combined with a generator comparison between the discrete SGD increments and the limiting evolution. The cylindrical noise is handled by working in the reproducing-kernel Hilbert space induced by the covariance operator, which is assumed Hilbert-Schmidt; this replaces the finite-dimensional Itô calculus with the corresponding infinite-dimensional version while preserving the cancellation of first-order terms due to zero-mean noise. To make the argument immediately verifiable we will insert, immediately before the statement of the main theorem, an explicit list of the regularity hypotheses: the test functional is twice Fréchet differentiable with bounded and continuous first and second derivatives, the drift satisfies a global Lipschitz condition, and the diffusion coefficient (square root of the covariance) is Lipschitz with linear growth. These assumptions are already used throughout the proofs but were not collected in one place; adding the list will resolve the concern without altering the result. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation applies standard techniques to new setting

full rationale

The paper extends existing diffusion-approximation results for SGD from Euclidean spaces to infinite-dimensional Hilbert spaces by establishing well-posedness of an SDE driven by cylindrical Brownian motion under structural conditions on the covariance operator, then proving second-order weak convergence on smooth functionals. No equations, definitions, or claims in the provided text reduce the target approximation result to a fitted parameter, a self-referential definition, or a load-bearing self-citation chain; the argument rests on classical stochastic-analysis tools applied to the infinite-dimensional case without tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard background from stochastic evolution equations plus one domain-specific assumption required for the cylindrical noise.

axioms (1)
  • domain assumption Structural conditions on the covariance operator are imposed to guarantee well-posedness of the limiting SDE.
    Explicitly identified in the abstract as necessary for the cylindrical Brownian motion in infinite dimensions.

pith-pipeline@v0.9.0 · 5500 in / 1221 out tokens · 55920 ms · 2026-05-10T14:58:03.910395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    Arridge , Optical tomography in medical imaging , Inverse problems, 15 (1999), pp

    S. Arridge , Optical tomography in medical imaging , Inverse problems, 15 (1999), pp. R41--R93

  2. [2]

    O ktem, and C.-B. Sch \

    S. Arridge, P. Maass, O. \"O ktem, and C.-B. Sch \"o nlieb , Solving inverse problems using data-driven models , Acta numerica, 28 (2019), pp. 1--174

  3. [3]

    Bottou, F

    L. Bottou, F. E. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning , SIAM Review, 60 (2018), pp. 223--311

  4. [4]

    Cartan , Differential forms , Courier Corporation, 2012

    H. Cartan , Differential forms , Courier Corporation, 2012

  5. [5]

    D. L. Colton and R. Kress , Inverse acoustic and electromagnetic scattering theory , vol. 93, Springer, 1998

  6. [6]

    Da Prato and J

    G. Da Prato and J. Zabczyk , Stochastic Equations in Infinite Dimensions , Encyclopedia of Mathematics and its Applications, Cambridge University Press, 2nd ed., 2014

  7. [7]

    H. W. Engl, M. Hanke, and A. Neubauer , Regularization of inverse problems , vol. 375, Springer, 1996

  8. [8]

    Gawarecki and V

    L. Gawarecki and V. Mandrekar , Stochastic differential equations in infinite dimensions: with applications to stochastic partial differential equations , Springer, 2010

  9. [9]

    Hardt, B

    M. Hardt, B. Recht, and Y. Singer , Train faster, generalize better: Stability of stochastic gradient descent , in Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, 2016, pp. 1225--1234

  10. [10]

    J. P. Kaipio and E. Somersalo , Statistical and computational inverse problems , Springer, 2005

  11. [11]

    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang , On large-batch training for deep learning: Generalization gap and sharp minima , arXiv:1609.04836, (2016)

  12. [12]

    LeCun, L

    Y. LeCun, L. Bottou, G. B. Orr, and K.-R. M \"u ller , Efficient backprop , in Neural networks: Tricks of the trade, Springer, 2002, pp. 9--50

  13. [13]

    Q. Li, C. Tai, and W. E , Stochastic modified equations and adaptive stochastic gradient algorithms , in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, PMLR, 06--11 Aug 2017, pp. 2101--2110

  14. [14]

    height 2pt depth -1.6pt width 23pt, Stochastic modified equations and dynamics of stochastic gradient algorithms I : Mathematical foundations , Journal of Machine Learning Research, 20 (2019), pp. 1--47

  15. [15]

    Z. Li, S. Malladi, and S. Arora , On the validity of modeling SGD with stochastic differential equations ( SDEs ) , Advances in Neural Information Processing Systems, 34 (2021), pp. 12712--12725

  16. [16]

    G. J. Lord, C. E. Powell, and T. Shardlow , An Introduction to Computational Stochastic PDEs , Cambridge Texts in Applied Mathematics, Cambridge University Press, 2014

  17. [17]

    Mandt, M

    S. Mandt, M. D. Hoffman, and D. M. Blei , Stochastic gradient descent as approximate bayesian inference , Journal of Machine Learning Research, 18 (2017), pp. 1--35

  18. [18]

    Mao , Stochastic differential equations and applications , Horwood Publishing Limited, Chichester, 2nd ed., 2008

    X. Mao , Stochastic differential equations and applications , Horwood Publishing Limited, Chichester, 2nd ed., 2008

  19. [19]

    S. Mei, A. Montanari, and P.-M. Nguyen , A mean field view of the landscape of two-layer neural networks , Proceedings of the National Academy of Sciences, 115 (2018), pp. E7665--E7671

  20. [20]

    Natterer , The mathematics of computerized tomography , SIAM, 2001

    F. Natterer , The mathematics of computerized tomography , SIAM, 2001

  21. [21]

    Pfeiffer , X -ray ptychography , Nature Photonics, 12 (2018), pp

    F. Pfeiffer , X -ray ptychography , Nature Photonics, 12 (2018), pp. 9--17

  22. [22]

    Prévôt and M

    C. Prévôt and M. Röckner , A Concise Course on Stochastic Partial Differential Equations , Springer, 2007

  23. [23]

    Robbins and S

    H. Robbins and S. Monro , A stochastic approximation method , The annals of mathematical statistics, 22 (1951), pp. 400--407

  24. [24]

    Sirignano and K

    J. Sirignano and K. Spiliopoulos , Mean field analysis of deep neural networks , Mathematics of Operations Research, 47 (2022), pp. 120--152

  25. [25]

    A. M. Stuart , Inverse problems: a bayesian perspective , A cta N umerica, 19 (2010), pp. 451--559

  26. [26]

    Sutskever, J

    I. Sutskever, J. Martens, G. Dahl, and G. Hinton , On the importance of initialization and momentum in deep learning , in International conference on machine learning, pmlr, 2013, pp. 1139--1147

  27. [27]

    Tarantola , Inverse problem theory and methods for model parameter estimation , SIAM, 2005

    A. Tarantola , Inverse problem theory and methods for model parameter estimation , SIAM, 2005

  28. [28]

    C. R. Vogel , Computational methods for inverse problems , SIAM, 2002

  29. [29]

    Understanding deep learning requires rethinking generalization

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals , Understanding deep learning requires rethinking generalization , arXiv:1611.03530, (2016)