arxiv: 2604.02969 · v1 · submitted 2026-04-03 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Recognition: 2 theorem links

· Lean Theorem

Inversion-Free Natural Gradient Descent on Riemannian Manifolds

Dario Draca , Takuo Matsubara , Minh-Ngoc Tran

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME

keywords natural gradientRiemannian manifoldstochastic optimizationFisher information matrixonline approximationconvergence ratevariational Bayesnormalizing flows

0 comments

The pith

An inversion-free natural gradient method on Riemannian manifolds updates an online approximation to the inverse Fisher information matrix using transported score vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a stochastic natural gradient descent algorithm that works directly on Riemannian manifolds without needing to invert the Fisher information matrix at each step. It maintains a low-cost online estimate of the inverse FIM by incorporating score vectors from current and past iterates after transporting them to a common tangent space. This yields almost-sure convergence of the squared distance to the optimum at rate O(log s / s^α) provided the step-size schedule has exponent α exceeding 2/3, while the approximation error in the FIM accumulates from the transport steps. A limited-memory version is also given to control storage, and the approach is tested on variational Bayes problems with Gaussian and flow-based approximations where manifold geometry enforces constraints naturally.

Core claim

The central discovery is an inversion-free update for the approximate inverse FIM on the manifold that combines transported score vectors at quadratic cost per iteration and delivers almost-sure convergence rates O(log s / s^α) for the parameter distance when α > 2/3, together with corresponding rates for the FIM approximation itself.

What carries the argument

The transport-based quadratic update rule for the running inverse-FIM approximation, which moves score vectors between successive tangent spaces to avoid explicit inversion.

If this is right

The squared distance to the minimizer converges almost surely at O(log s / s^α) for step-size exponents α > 2/3.
The approximate FIM converges at almost-sure rates despite the added transport errors.
The limited-memory variant achieves the same rates with sub-quadratic storage.
The method applies directly to constrained problems such as positive-definite covariance estimation and orthogonal parameters in normalizing flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar transport-based updates could be applied to other online matrix approximations on manifolds beyond the FIM.
The method may allow natural extension to non-Euclidean parameter spaces in deep learning models that already use Riemannian optimization.
Choosing manifolds with closed-form transport maps would minimize the error accumulation shown in the rates.
Testing on higher-dimensional manifolds could reveal whether the quadratic cost remains practical at scale.

Load-bearing premise

The transport operations between tangent spaces must be accurate enough that their accumulated errors do not invalidate the convergence rates, and the manifold must be regular enough for the stochastic approximation to converge.

What would settle it

A numerical counter-example on a simple manifold such as the positive-definite cone where large transport approximation error causes the iterates to diverge or fail to achieve the predicted rate.

Figures

Figures reproduced from arXiv: 2604.02969 by Dario Draca, Minh-Ngoc Tran, Takuo Matsubara.

**Figure 2.** Figure 2: Comparison of approximate/exact natural gradient algorithms on larger datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: Negative lower bound plots of the three VB methods [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

read the original abstract

The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper proposes an inversion-free stochastic natural gradient method for probability distributions whose parameters lie on a Riemannian manifold. The manifold setting offers several advantages: one can implicitly enforce parameter constraints such as positive definiteness and orthogonality, ensure parameters are identifiable, or guarantee regularity properties of the objective like geodesic convexity. Building on an intrinsic formulation of the Fisher information matrix (FIM) on a manifold, our method maintains an online approximation of the inverse FIM, which is efficiently updated at quadratic cost using score vectors sampled at successive iterates. In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations. We prove almost-sure convergence rates of $O(\log{s}/s^\alpha)$ for the squared distance to the minimizer when the step size exponent $\alpha >2/3$. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors. A limited-memory variant of the algorithm with sub-quadratic storage complexity is proposed. Finally, we demonstrate the effectiveness of our method relative to its Euclidean counterparts on variational Bayes with Gaussian approximations and normalizing flows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical inversion-free natural gradient on Riemannian manifolds with almost-sure rates, but the transport error bounds are the part that needs the closest check.

read the letter

Hi, the core contribution here is a stochastic natural gradient update that skips matrix inversion on a manifold by maintaining an online quadratic-cost approximation to the inverse Fisher information matrix. Score vectors from different tangent spaces get combined via transport, and the authors prove almost-sure rates of O(log s / s^α) for squared distance to the minimizer when α > 2/3, plus rates for the approximate FIM itself that explicitly include the transport errors. A limited-memory version is also given, and they test it on variational Bayes with Gaussian approximations and normalizing flows, where it beats the Euclidean baseline in the reported runs. The manifold framing is useful because it lets constraints like positive definiteness be enforced automatically without extra penalties or projections. The online update and the explicit rates are the parts that feel new relative to standard Euclidean natural gradient work. The transport step is handled correctly in principle by using parallel transport, and the paper is upfront that the FIM approximation now carries accumulated transport errors. The main soft spot is whether those errors stay controlled enough for the claimed rates to survive, especially near α = 2/3. If each transport discrepancy is order of the step size, the summed error is order s^{1-α}, and the supermartingale argument has to absorb it without losing the rate; the abstract says the rates hold anyway, so the proof presumably closes the bound, but that is the section I would read first. Geodesic convexity is also assumed, which is standard but restricts the settings where the result applies directly. This is for people doing stochastic optimization or variational inference on constrained parameter spaces. It has enough algorithmic novelty and theoretical content to deserve a serious referee, even if the error analysis will probably need tightening in revision. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces an inversion-free stochastic natural gradient descent algorithm for optimization over Riemannian manifolds, particularly for probability distributions. It maintains an online approximation to the inverse Fisher information matrix by updating with score vectors that are parallel-transported between tangent spaces at successive iterates. The central theoretical claims are almost-sure convergence rates of O(log s / s^α) for the squared distance to the minimizer when the step-size exponent α > 2/3, together with almost-sure rates for the approximate FIM that explicitly accumulate transport-based errors. A limited-memory variant with sub-quadratic storage is proposed, and the method is evaluated on variational Bayes tasks using Gaussian approximations and normalizing flows.

Significance. If the claimed rates survive the accumulated transport errors, the work would be a useful extension of natural-gradient methods to manifold-constrained settings that arise in statistical machine learning. The inversion-free online update, explicit handling of transport operations, and limited-memory implementation address both computational and geometric constraints, while the almost-sure rates provide stronger guarantees than typical in-expectation analyses.

major comments (1)

[Convergence analysis] Convergence analysis (proof of the almost-sure rate for squared distance): the argument must show that the summed transport discrepancy, which is O(s^{1-α}) when each parallel-transport error is O(‖x_{s+1}-x_s‖), remains dominated by the driving noise and step-size terms in the supermartingale inequality for all α > 2/3. An explicit bound on the per-step transport error (or its expectation) and its insertion into the distance recursion is required; without it the rate claim for α close to 2/3 is not yet verified.

minor comments (2)

[Abstract] The abstract states that the approximate FIM 'accumulates transport-based errors' yet still yields rates; the precise order of these errors and the manifold regularity assumptions (e.g., bounded curvature, geodesic convexity) should be stated explicitly in the abstract or introduction.
[Experiments] Experimental section: the choice of step-size exponent α and the concrete implementation of parallel transport (exact or approximate) should be reported for each experiment so that the claimed robustness to transport errors can be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We appreciate the positive assessment of the method's potential and address the major comment on the convergence analysis below. We will revise the paper to incorporate the requested explicit bounds and verifications.

read point-by-point responses

Referee: [Convergence analysis] Convergence analysis (proof of the almost-sure rate for squared distance): the argument must show that the summed transport discrepancy, which is O(s^{1-α}) when each parallel-transport error is O(‖x_{s+1}-x_s‖), remains dominated by the driving noise and step-size terms in the supermartingale inequality for all α > 2/3. An explicit bound on the per-step transport error (or its expectation) and its insertion into the distance recursion is required; without it the rate claim for α close to 2/3 is not yet verified.

Authors: We agree that an explicit treatment of the transport discrepancy is necessary to fully verify the almost-sure rate for α close to 2/3. In the revised manuscript we will add a supporting lemma that bounds the per-step parallel-transport error by O(‖x_{s+1}−x_s‖) (with the implicit constant depending only on sectional-curvature bounds and the local Lipschitz constant of the score map, both assumed finite in a neighborhood of the minimizer). This bound will be inserted directly into the supermartingale recursion for the squared geodesic distance. The resulting accumulated transport term is O(s^{1-α}); we will show that, for every α>2/3, this term is dominated by the driving step-size and martingale-noise contributions that already produce the claimed O(log s / s^α) rate. The expanded argument, including the precise comparison of exponents, will appear in the appendix of the revised version. revision: yes

Circularity Check

0 steps flagged

Minor self-citation on background Riemannian tools; central convergence derivation remains independent

full rationale

The paper derives almost-sure rates O(log s / s^α) for α > 2/3 via supermartingale analysis on the squared geodesic distance, explicitly incorporating accumulated transport errors in the approximate FIM recursion. These steps rest on standard assumptions (geodesic convexity, bounded curvature for transport bounds) drawn from the Riemannian optimization literature rather than any self-defined quantity or fitted parameter. No equation reduces a claimed prediction to an input by construction, and any self-citations support only auxiliary geometric facts without bearing the load of the rate proof.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of an intrinsic Fisher information matrix on the manifold, the feasibility of transport maps between tangent spaces at successive iterates, and standard stochastic-approximation assumptions on step sizes and noise.

free parameters (1)

step-size exponent α
Must satisfy α > 2/3 for the stated almost-sure rate; chosen by the user to guarantee convergence.

axioms (2)

domain assumption The parameter space is a Riemannian manifold equipped with a well-defined Fisher information metric.
Invoked to justify the intrinsic formulation and the need for transport operations.
ad hoc to paper Score vectors from different iterates can be transported without introducing errors that invalidate the convergence analysis.
Required for the almost-sure rates to hold in the presence of manifold curvature.

pith-pipeline@v0.9.0 · 5524 in / 1442 out tokens · 37890 ms · 2026-05-13T18:12:12.978807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove almost-sure convergence rates of O(log s / s^α) for the squared distance to the minimizer when the step size exponent α > 2/3. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Riemannian Adaptive Optimization Methods

Absil, P.-A., Mahony, R. & Sepulchre, R. (2009), Optimization algorithms on matrix mani- folds,in‘Optimization Algorithms on Matrix Manifolds’, Princeton University Press. Absil, P.-A. & Malick, J. (2012), ‘Projection-like retractions on matrix manifolds’,SIAM Journal on Optimization22(1), 135–158. Akyol, M. & Şahin, B. (2019), ‘Conformal semi-invariant R...

work page Pith review arXiv 2009
[2]

& Sepulchre, R

31 Meyer, G., Bonnabel, S. & Sepulchre, R. (2011), Linear regression under fixed-rank con- straints: a Riemannian approach,in‘28th International Conference on Machine Learning’. Mishra, B., Meyer, G., Bonnabel, S. & Sepulchre, R. (2014), ‘Fixed-rank matrix factoriza- tions and Riemannian low-rank optimization’,Computational Statistics29(3), 591–621. Neste...

work page 2011
[3]

& Hansen, N

Ollivier, Y., Arnold, L., Auger, A. & Hansen, N. (2017), ‘Information-geometric optimization algorithms: A unifying picture via invariance principles’,Journal of Machine Learning Research18(18), 1–65. Park, H., Amari, S.-I.&Fukumizu, K.(2000), ‘Adaptivenaturalgradientlearningalgorithms for various stochastic models’,Neural Networks13(7), 755–764. Pennec, ...

work page 2017
[4]

Tran, M.-N., Nott, D. J. & Kohn, R. (2017), ‘Variational Bayes with intractable likelihood’, Journal of Computational and Graphical Statistics26(4), 873–882. Tripuraneni, N., Flammarion, N., Bach, F. & Jordan, M. I. (2018), Averaging stochastic gradient descent on Riemannian manifolds,in‘Conference On Learning Theory’, PMLR, pp. 650–687. Villani, C. et al...

work page 2017
[5]

SinceΓR is isometric, we can evaluate the norm as ∥Tvx[w]∥Rx(v) =∥Gx(v)[w]∥x =∥(Gx(v)[w]−w) +w∥x.(8.7) The result follows via the ordinary and reversed triangle inequalities

We have the truncated Taylor expansion: Gx(v) =Id x + 1 2D2Gx(t∗v)[v,v,·], t∗∈[0,1].(8.5) Hence ∥Tvx[w]−ΓR vx[w]∥Rx(v) =∥Gx(v)[w]−w∥x≤∥D2Gx(t∗v)∥op∥v∥2 x∥w∥x.(8.6) By the smoothness ofΓR andR, letc<∞denote the supremum of∥D2Gx(u)∥op over the compact set{(x,u)∈TX|∥u∥x≤δ}. SinceΓR is isometric, we can evaluate the norm as ∥Tvx[w]∥Rx(v) =∥Gx(v)[w]∥x =∥(Gx(v)...

work page 2018
[6]

zero acceleration

The main result then follows by a Taylor series argument. DifferentiatingGalongv, we have: d dtG(tv)[w]|t=0 = (∇DR−1 x )[v,Dexp x(0)[w]] +DR−1 x (x)[∇v(Dexpx[w])](8.12) = (∇DR−1 x )[v,w] +∇v(Dexpx[w]).(8.13) 8This terminology is standard in Riemannian geometry; see e.g. Do Carmo & Flaherty Francis (1992). 37 Here∇v(Dexpx[w])is the ordinary covariant deriv...

work page 1992
[7]

An immediate consequence of the previous lemma is thatTagrees with parallel transport up to second order terms

The Taylor expansion is thus G(v) =Id x +O(∥v∥2 x), where∥v∥x =∥exp−1 x (y)∥x =d(x,y), which concludes the proof. An immediate consequence of the previous lemma is thatTagrees with parallel transport up to second order terms. In the following lemmas, we defineTx,y[w] :=DR x(R−1 x (y))[w], and letΓx,y denote the parallel transport along the geodesic connec...

work page 2019
[8]

42 B.3 Claim: composite transport behaves like an isometry for largek,s

1+δ )    ≤1 [ (1 +O(∥vs∥2 θ(s−1)))Vs + 4(C′ 0 +C′ 1W 2 s ) slogs 1+δ ] .(8.38) By the Robbins-Siegmund theorem,Vs =O(1)and thus∥Ms∥2 θ(s−1)=O ( logs 1+δ/s ) a.s. 42 B.3 Claim: composite transport behaves like an isometry for largek,s. LetΓ R s,s−1denote parallel transport alongγ(t) =Rθ(s)(tR−1 θ(s)(θ(s−1)))fromt= 0to1. Write T−1 [s,k] = ΓR,−1 s,s−1(ΓR...

work page 2024
[9]

via a Taylor series that the coefficient ofVs is≤1eventually, hence ∥M3 s+1∥θ∗=O ( logs 2+δ s2α−1 ) ,a.s

2+δ, it follows that E[Vs+1|Fs]≤ ( 1−˜c sα ) ( 1 + 1 s )2α−1 Vs +O ( 1 slogs 1+δ ) .(8.94) One can show e.g. via a Taylor series that the coefficient ofVs is≤1eventually, hence ∥M3 s+1∥θ∗=O ( logs 2+δ s2α−1 ) ,a.s. (8.95) D.6 Claim:∥δs∥θ∗=O(∥∆s∥2 θ∗) Expanding the definition ofδs δs = Γ−1 ∗,s∇L(θ(s))−∇2L(θ∗) exp−1 θ∗(θ(s))    =(a) +∇2L(θ∗)[exp−1 θ∗(θ(...

work page 2009
[10]

sliding window

1+δ.(8.107) The final term has a finite sum a.s., therefore by the Robbins-Siegmund theorem, ∞∑ s=2 Vs = ∞∑ s=2 s2α−1 logs 1+δ∥vs∥2 θ(s−1)<∞,a.s. (8.108) Letc s :=s γ/(logs) 1+δfor someδ >0andγ <2α−1; by Cauchy-Schwarz ∞∑ s=k d(θ(s),θ(s+1))d(θ(s),θ∗)≤ (∞∑ s=k csd(θ(s),θ(s+1))2 )1/2 (∞∑ s=k c−1 s d(θ(s),θ∗)2 )1/2 .(8.109) From Theorem 8.2 we haved(θ(s),θ(s...

work page 2005
[11]

Takatsu (2011), Malagò et al

There are many references on the latter; see e.g. Takatsu (2011), Malagò et al. (2018), Bhatia et al. (2019). Then, we provide some additional details on our experimental methodology. I.1 The Fisher and Bures-Wasserstein Metrics The space of non-degenerate Gaussian distributions onRd can be viewed as a smooth man- ifoldM, parametrized by the mean and cova...

work page 2011
[12]

(2018), we include it here for completeness

:= (m2−m1,Σ−1 1 #Σ 2−I).(8.171) 16This is also provided in Malagò et al. (2018), we include it here for completeness. 66 Lettinglog Σ 1(Σ

work page 2018
[13]

] .(8.172) I.3 Algorithm 1 - Representing, Updating, and TransportingH−1 s Letℓ(x)denote the log-likelihood ofN(µ,Σ). The two components of the score are then ∇µℓ(x) = Σ−1(x−µ),∇Σℓ(x) =1 2Σ−1(x−µ)(x−µ)⊤Σ−1−1 2Σ−1.(8.173) The approximate Fisher informationHs is decomposed into parts which act separately on the mean and covariance parameters; we denote thes...

work page 2019
[14]

is RX(U) = ( I−1 2WU )−1( I+ 1 2WU ) X.(8.198) The associated isometric vector transport (Zhu 2017, Lemma

work page 2017
[15]

is TU(V) = ( I−1 2WU )−1( I+ 1 2WU ) V, V∈T XSt(p,n),(8.199) which mapsTXSt(p,n)→TRX(U)St(p,n). J.2 Algorithm Details The variational parameterθ= (W1,W 2,b 1,b 2)belongs to the product manifold M= St(d,d)×St(d,d)×Rd×Rd.(8.200) Closed-form expressions for the Euclidean score∇E θlogqθ(y)and the negative ELBO gradient ∇E θL(θ)are derived in Appendix A.5 of G...

work page 2024
[16]

Riemannian Natural Gradient:This method follows the limited-memory variant of Algorithm 1 described in Remark 4.3 to move and updateH−1 s

Riemannian Stochastic Gradient Descent:This is given by 72 θ(s+1) =R θ(s)(−τs+1∇θˆL(θ(s))),(8.202) where∇θˆL(θ(s)) = Projθ(s)[∇E θ ˆL(θ(s))]is the stochastic Riemannian NELBO gradient. Riemannian Natural Gradient:This method follows the limited-memory variant of Algorithm 1 described in Remark 4.3 to move and updateH−1 s . The update is θ(s+1) =R θ(s)(−τs...

work page 2024