pith. machine review for the scientific record. sign in

arxiv: 2604.02969 · v1 · submitted 2026-04-03 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Recognition: 2 theorem links

· Lean Theorem

Inversion-Free Natural Gradient Descent on Riemannian Manifolds

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME
keywords natural gradientRiemannian manifoldstochastic optimizationFisher information matrixonline approximationconvergence ratevariational Bayesnormalizing flows
0
0 comments X

The pith

An inversion-free natural gradient method on Riemannian manifolds updates an online approximation to the inverse Fisher information matrix using transported score vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a stochastic natural gradient descent algorithm that works directly on Riemannian manifolds without needing to invert the Fisher information matrix at each step. It maintains a low-cost online estimate of the inverse FIM by incorporating score vectors from current and past iterates after transporting them to a common tangent space. This yields almost-sure convergence of the squared distance to the optimum at rate O(log s / s^α) provided the step-size schedule has exponent α exceeding 2/3, while the approximation error in the FIM accumulates from the transport steps. A limited-memory version is also given to control storage, and the approach is tested on variational Bayes problems with Gaussian and flow-based approximations where manifold geometry enforces constraints naturally.

Core claim

The central discovery is an inversion-free update for the approximate inverse FIM on the manifold that combines transported score vectors at quadratic cost per iteration and delivers almost-sure convergence rates O(log s / s^α) for the parameter distance when α > 2/3, together with corresponding rates for the FIM approximation itself.

What carries the argument

The transport-based quadratic update rule for the running inverse-FIM approximation, which moves score vectors between successive tangent spaces to avoid explicit inversion.

If this is right

  • The squared distance to the minimizer converges almost surely at O(log s / s^α) for step-size exponents α > 2/3.
  • The approximate FIM converges at almost-sure rates despite the added transport errors.
  • The limited-memory variant achieves the same rates with sub-quadratic storage.
  • The method applies directly to constrained problems such as positive-definite covariance estimation and orthogonal parameters in normalizing flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar transport-based updates could be applied to other online matrix approximations on manifolds beyond the FIM.
  • The method may allow natural extension to non-Euclidean parameter spaces in deep learning models that already use Riemannian optimization.
  • Choosing manifolds with closed-form transport maps would minimize the error accumulation shown in the rates.
  • Testing on higher-dimensional manifolds could reveal whether the quadratic cost remains practical at scale.

Load-bearing premise

The transport operations between tangent spaces must be accurate enough that their accumulated errors do not invalidate the convergence rates, and the manifold must be regular enough for the stochastic approximation to converge.

What would settle it

A numerical counter-example on a simple manifold such as the positive-definite cone where large transport approximation error causes the iterates to diverge or fail to achieve the predicted rate.

Figures

Figures reproduced from arXiv: 2604.02969 by Dario Draca, Minh-Ngoc Tran, Takuo Matsubara.

Figure 1
Figure 1. Figure 1: NELBO versus iteration for the Bayesian logistic regression model. Top row: [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of approximate/exact natural gradient algorithms on larger datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Negative lower bound plots of the three VB methods [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
read the original abstract

The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper proposes an inversion-free stochastic natural gradient method for probability distributions whose parameters lie on a Riemannian manifold. The manifold setting offers several advantages: one can implicitly enforce parameter constraints such as positive definiteness and orthogonality, ensure parameters are identifiable, or guarantee regularity properties of the objective like geodesic convexity. Building on an intrinsic formulation of the Fisher information matrix (FIM) on a manifold, our method maintains an online approximation of the inverse FIM, which is efficiently updated at quadratic cost using score vectors sampled at successive iterates. In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations. We prove almost-sure convergence rates of $O(\log{s}/s^\alpha)$ for the squared distance to the minimizer when the step size exponent $\alpha >2/3$. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors. A limited-memory variant of the algorithm with sub-quadratic storage complexity is proposed. Finally, we demonstrate the effectiveness of our method relative to its Euclidean counterparts on variational Bayes with Gaussian approximations and normalizing flows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces an inversion-free stochastic natural gradient descent algorithm for optimization over Riemannian manifolds, particularly for probability distributions. It maintains an online approximation to the inverse Fisher information matrix by updating with score vectors that are parallel-transported between tangent spaces at successive iterates. The central theoretical claims are almost-sure convergence rates of O(log s / s^α) for the squared distance to the minimizer when the step-size exponent α > 2/3, together with almost-sure rates for the approximate FIM that explicitly accumulate transport-based errors. A limited-memory variant with sub-quadratic storage is proposed, and the method is evaluated on variational Bayes tasks using Gaussian approximations and normalizing flows.

Significance. If the claimed rates survive the accumulated transport errors, the work would be a useful extension of natural-gradient methods to manifold-constrained settings that arise in statistical machine learning. The inversion-free online update, explicit handling of transport operations, and limited-memory implementation address both computational and geometric constraints, while the almost-sure rates provide stronger guarantees than typical in-expectation analyses.

major comments (1)
  1. [Convergence analysis] Convergence analysis (proof of the almost-sure rate for squared distance): the argument must show that the summed transport discrepancy, which is O(s^{1-α}) when each parallel-transport error is O(‖x_{s+1}-x_s‖), remains dominated by the driving noise and step-size terms in the supermartingale inequality for all α > 2/3. An explicit bound on the per-step transport error (or its expectation) and its insertion into the distance recursion is required; without it the rate claim for α close to 2/3 is not yet verified.
minor comments (2)
  1. [Abstract] The abstract states that the approximate FIM 'accumulates transport-based errors' yet still yields rates; the precise order of these errors and the manifold regularity assumptions (e.g., bounded curvature, geodesic convexity) should be stated explicitly in the abstract or introduction.
  2. [Experiments] Experimental section: the choice of step-size exponent α and the concrete implementation of parallel transport (exact or approximate) should be reported for each experiment so that the claimed robustness to transport errors can be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We appreciate the positive assessment of the method's potential and address the major comment on the convergence analysis below. We will revise the paper to incorporate the requested explicit bounds and verifications.

read point-by-point responses
  1. Referee: [Convergence analysis] Convergence analysis (proof of the almost-sure rate for squared distance): the argument must show that the summed transport discrepancy, which is O(s^{1-α}) when each parallel-transport error is O(‖x_{s+1}-x_s‖), remains dominated by the driving noise and step-size terms in the supermartingale inequality for all α > 2/3. An explicit bound on the per-step transport error (or its expectation) and its insertion into the distance recursion is required; without it the rate claim for α close to 2/3 is not yet verified.

    Authors: We agree that an explicit treatment of the transport discrepancy is necessary to fully verify the almost-sure rate for α close to 2/3. In the revised manuscript we will add a supporting lemma that bounds the per-step parallel-transport error by O(‖x_{s+1}−x_s‖) (with the implicit constant depending only on sectional-curvature bounds and the local Lipschitz constant of the score map, both assumed finite in a neighborhood of the minimizer). This bound will be inserted directly into the supermartingale recursion for the squared geodesic distance. The resulting accumulated transport term is O(s^{1-α}); we will show that, for every α>2/3, this term is dominated by the driving step-size and martingale-noise contributions that already produce the claimed O(log s / s^α) rate. The expanded argument, including the precise comparison of exponents, will appear in the appendix of the revised version. revision: yes

Circularity Check

0 steps flagged

Minor self-citation on background Riemannian tools; central convergence derivation remains independent

full rationale

The paper derives almost-sure rates O(log s / s^α) for α > 2/3 via supermartingale analysis on the squared geodesic distance, explicitly incorporating accumulated transport errors in the approximate FIM recursion. These steps rest on standard assumptions (geodesic convexity, bounded curvature for transport bounds) drawn from the Riemannian optimization literature rather than any self-defined quantity or fitted parameter. No equation reduces a claimed prediction to an input by construction, and any self-citations support only auxiliary geometric facts without bearing the load of the rate proof.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of an intrinsic Fisher information matrix on the manifold, the feasibility of transport maps between tangent spaces at successive iterates, and standard stochastic-approximation assumptions on step sizes and noise.

free parameters (1)
  • step-size exponent α
    Must satisfy α > 2/3 for the stated almost-sure rate; chosen by the user to guarantee convergence.
axioms (2)
  • domain assumption The parameter space is a Riemannian manifold equipped with a well-defined Fisher information metric.
    Invoked to justify the intrinsic formulation and the need for transport operations.
  • ad hoc to paper Score vectors from different iterates can be transported without introducing errors that invalidate the convergence analysis.
    Required for the almost-sure rates to hold in the presence of manifold curvature.

pith-pipeline@v0.9.0 · 5524 in / 1442 out tokens · 37890 ms · 2026-05-13T18:12:12.978807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Riemannian Adaptive Optimization Methods

    Absil, P.-A., Mahony, R. & Sepulchre, R. (2009), Optimization algorithms on matrix mani- folds,in‘Optimization Algorithms on Matrix Manifolds’, Princeton University Press. Absil, P.-A. & Malick, J. (2012), ‘Projection-like retractions on matrix manifolds’,SIAM Journal on Optimization22(1), 135–158. Akyol, M. & Şahin, B. (2019), ‘Conformal semi-invariant R...

  2. [2]

    & Sepulchre, R

    31 Meyer, G., Bonnabel, S. & Sepulchre, R. (2011), Linear regression under fixed-rank con- straints: a Riemannian approach,in‘28th International Conference on Machine Learning’. Mishra, B., Meyer, G., Bonnabel, S. & Sepulchre, R. (2014), ‘Fixed-rank matrix factoriza- tions and Riemannian low-rank optimization’,Computational Statistics29(3), 591–621. Neste...

  3. [3]

    & Hansen, N

    Ollivier, Y., Arnold, L., Auger, A. & Hansen, N. (2017), ‘Information-geometric optimization algorithms: A unifying picture via invariance principles’,Journal of Machine Learning Research18(18), 1–65. Park, H., Amari, S.-I.&Fukumizu, K.(2000), ‘Adaptivenaturalgradientlearningalgorithms for various stochastic models’,Neural Networks13(7), 755–764. Pennec, ...

  4. [4]

    Tran, M.-N., Nott, D. J. & Kohn, R. (2017), ‘Variational Bayes with intractable likelihood’, Journal of Computational and Graphical Statistics26(4), 873–882. Tripuraneni, N., Flammarion, N., Bach, F. & Jordan, M. I. (2018), Averaging stochastic gradient descent on Riemannian manifolds,in‘Conference On Learning Theory’, PMLR, pp. 650–687. Villani, C. et al...

  5. [5]

    SinceΓR is isometric, we can evaluate the norm as ∥Tvx[w]∥Rx(v) =∥Gx(v)[w]∥x =∥(Gx(v)[w]−w) +w∥x.(8.7) The result follows via the ordinary and reversed triangle inequalities

    We have the truncated Taylor expansion: Gx(v) =Id x + 1 2D2Gx(t∗v)[v,v,·], t∗∈[0,1].(8.5) Hence ∥Tvx[w]−ΓR vx[w]∥Rx(v) =∥Gx(v)[w]−w∥x≤∥D2Gx(t∗v)∥op∥v∥2 x∥w∥x.(8.6) By the smoothness ofΓR andR, letc<∞denote the supremum of∥D2Gx(u)∥op over the compact set{(x,u)∈TX|∥u∥x≤δ}. SinceΓR is isometric, we can evaluate the norm as ∥Tvx[w]∥Rx(v) =∥Gx(v)[w]∥x =∥(Gx(v)...

  6. [6]

    zero acceleration

    The main result then follows by a Taylor series argument. DifferentiatingGalongv, we have: d dtG(tv)[w]|t=0 = (∇DR−1 x )[v,Dexp x(0)[w]] +DR−1 x (x)[∇v(Dexpx[w])](8.12) = (∇DR−1 x )[v,w] +∇v(Dexpx[w]).(8.13) 8This terminology is standard in Riemannian geometry; see e.g. Do Carmo & Flaherty Francis (1992). 37 Here∇v(Dexpx[w])is the ordinary covariant deriv...

  7. [7]

    An immediate consequence of the previous lemma is thatTagrees with parallel transport up to second order terms

    The Taylor expansion is thus G(v) =Id x +O(∥v∥2 x), where∥v∥x =∥exp−1 x (y)∥x =d(x,y), which concludes the proof. An immediate consequence of the previous lemma is thatTagrees with parallel transport up to second order terms. In the following lemmas, we defineTx,y[w] :=DR x(R−1 x (y))[w], and letΓx,y denote the parallel transport along the geodesic connec...

  8. [8]

    42 B.3 Claim: composite transport behaves like an isometry for largek,s

    1+δ )    ≤1 [ (1 +O(∥vs∥2 θ(s−1)))Vs + 4(C′ 0 +C′ 1W 2 s ) slogs 1+δ ] .(8.38) By the Robbins-Siegmund theorem,Vs =O(1)and thus∥Ms∥2 θ(s−1)=O ( logs 1+δ/s ) a.s. 42 B.3 Claim: composite transport behaves like an isometry for largek,s. LetΓ R s,s−1denote parallel transport alongγ(t) =Rθ(s)(tR−1 θ(s)(θ(s−1)))fromt= 0to1. Write T−1 [s,k] = ΓR,−1 s,s−1(ΓR...

  9. [9]

    via a Taylor series that the coefficient ofVs is≤1eventually, hence ∥M3 s+1∥θ∗=O ( logs 2+δ s2α−1 ) ,a.s

    2+δ, it follows that E[Vs+1|Fs]≤ ( 1−˜c sα ) ( 1 + 1 s )2α−1 Vs +O ( 1 slogs 1+δ ) .(8.94) One can show e.g. via a Taylor series that the coefficient ofVs is≤1eventually, hence ∥M3 s+1∥θ∗=O ( logs 2+δ s2α−1 ) ,a.s. (8.95) D.6 Claim:∥δs∥θ∗=O(∥∆s∥2 θ∗) Expanding the definition ofδs δs = Γ−1 ∗,s∇L(θ(s))−∇2L(θ∗) exp−1 θ∗(θ(s))    =(a) +∇2L(θ∗)[exp−1 θ∗(θ(...

  10. [10]

    sliding window

    1+δ.(8.107) The final term has a finite sum a.s., therefore by the Robbins-Siegmund theorem, ∞∑ s=2 Vs = ∞∑ s=2 s2α−1 logs 1+δ∥vs∥2 θ(s−1)<∞,a.s. (8.108) Letc s :=s γ/(logs) 1+δfor someδ >0andγ <2α−1; by Cauchy-Schwarz ∞∑ s=k d(θ(s),θ(s+1))d(θ(s),θ∗)≤ (∞∑ s=k csd(θ(s),θ(s+1))2 )1/2 (∞∑ s=k c−1 s d(θ(s),θ∗)2 )1/2 .(8.109) From Theorem 8.2 we haved(θ(s),θ(s...

  11. [11]

    Takatsu (2011), Malagò et al

    There are many references on the latter; see e.g. Takatsu (2011), Malagò et al. (2018), Bhatia et al. (2019). Then, we provide some additional details on our experimental methodology. I.1 The Fisher and Bures-Wasserstein Metrics The space of non-degenerate Gaussian distributions onRd can be viewed as a smooth man- ifoldM, parametrized by the mean and cova...

  12. [12]

    (2018), we include it here for completeness

    := (m2−m1,Σ−1 1 #Σ 2−I).(8.171) 16This is also provided in Malagò et al. (2018), we include it here for completeness. 66 Lettinglog Σ 1(Σ

  13. [13]

    ] .(8.172) I.3 Algorithm 1 - Representing, Updating, and TransportingH−1 s Letℓ(x)denote the log-likelihood ofN(µ,Σ). The two components of the score are then ∇µℓ(x) = Σ−1(x−µ),∇Σℓ(x) =1 2Σ−1(x−µ)(x−µ)⊤Σ−1−1 2Σ−1.(8.173) The approximate Fisher informationHs is decomposed into parts which act separately on the mean and covariance parameters; we denote thes...

  14. [14]

    is RX(U) = ( I−1 2WU )−1( I+ 1 2WU ) X.(8.198) The associated isometric vector transport (Zhu 2017, Lemma

  15. [15]

    is TU(V) = ( I−1 2WU )−1( I+ 1 2WU ) V, V∈T XSt(p,n),(8.199) which mapsTXSt(p,n)→TRX(U)St(p,n). J.2 Algorithm Details The variational parameterθ= (W1,W 2,b 1,b 2)belongs to the product manifold M= St(d,d)×St(d,d)×Rd×Rd.(8.200) Closed-form expressions for the Euclidean score∇E θlogqθ(y)and the negative ELBO gradient ∇E θL(θ)are derived in Appendix A.5 of G...

  16. [16]

    Riemannian Natural Gradient:This method follows the limited-memory variant of Algorithm 1 described in Remark 4.3 to move and updateH−1 s

    Riemannian Stochastic Gradient Descent:This is given by 72 θ(s+1) =R θ(s)(−τs+1∇θˆL(θ(s))),(8.202) where∇θˆL(θ(s)) = Projθ(s)[∇E θ ˆL(θ(s))]is the stochastic Riemannian NELBO gradient. Riemannian Natural Gradient:This method follows the limited-memory variant of Algorithm 1 described in Remark 4.3 to move and updateH−1 s . The update is θ(s+1) =R θ(s)(−τs...