Curvature-Aligned Probing for Local Loss-Landscape Stabilization

Andrey Grabovoy; Nikita Kiselev

arxiv: 2604.14870 · v1 · submitted 2026-04-16 · 💻 cs.LG

Curvature-Aligned Probing for Local Loss-Landscape Stabilization

Nikita Kiselev , Andrey Grabovoy This is my paper

Pith reviewed 2026-05-10 11:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords loss landscape stabilizationHessian eigenspacecurvature probingquadratic modelsubspace estimationmean-squared rateneural network training

0 comments

The pith

A curvature-aligned probe in the top Hessian eigenspace preserves the full-space stabilization rate while depending only on the small subspace dimension D.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recasts the problem of measuring local loss-landscape stabilization under growing sample counts as an observational task. It introduces a family of criteria that combine an aggregation order with a probing distribution over parameter directions. Within this family the authors single out the curvature-aligned criterion that restricts the probe to the top-D eigenvectors of the empirical Hessian at a trained point. From a local quadratic model of the loss they prove that this restricted probe retains the O(k^{-2}) mean-squared convergence rate of the unrestricted full-space version, but replaces dependence on the ambient parameter count with dependence on the much smaller D. They also supply closed-form spectral expressions, scalable estimators based on Hessian-vector products and subspace Monte Carlo, and numerical evidence that a tiny-D probe already reproduces the full-space signal on a decoder-only transformer.

Core claim

We introduce the curvature-aligned criterion Δ₂^(D) that probes the loss increment field exclusively in the top-D eigenspace of the empirical Hessian near a trained solution. Solely from a local quadratic model we prove that Δ₂^(D) preserves the O(k^{-2}) mean-squared rate of the full-space criterion while replacing ambient-dimension curvature dependence with dependence on the subspace dimension D. A corollary supplies a closed-form spectral expression for the criterion, and a proposition shows that the top-D eigenspace is extremal inside the eigenspace-aligned family. Scalable estimators follow from Hessian-vector products, subspace Monte Carlo sampling, and a Gaussian-moment proxy.

What carries the argument

The curvature-aligned criterion Δ₂^(D), which aggregates the squared loss increments observed along the top-D eigenvectors of the empirical Hessian.

If this is right

The stabilization signal can be tracked with cost that scales with D rather than the full parameter count.
Only Hessian-vector products and subspace sampling are required; no full Hessian storage is needed.
The top-D eigenspace is provably optimal among all fixed eigenspace probes.
A closed-form Gaussian proxy yields the criterion in closed form once the top-D spectrum is known.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Near minima the effective curvature rank of neural losses appears far lower than the ambient dimension, allowing reliable monitoring with modest D.
The same subspace could be reused for other local diagnostics such as sharpness or flatness measures.
If the quadratic regime extends farther than expected, the method may serve as a lightweight regularizer during late-stage training.

Load-bearing premise

The loss surface near a trained solution is well approximated by a quadratic whose dominant deformations are captured by a low-dimensional subspace spanned by the largest Hessian eigenvalues.

What would settle it

If, on trained neural networks, the mean-squared deviation between Δ₂^(D) and the full-space criterion fails to decay as O(k^{-2}) or remains visibly larger than numerical noise even for moderate D, the rate-preservation claim is falsified.

Figures

Figures reproduced from arXiv: 2604.14870 by Andrey Grabovoy, Nikita Kiselev.

**Figure 1.** Figure 1: Local stabilization as an observational problem. Existing local criteria differ not only in aggregation order, but also in how they probe the increment field. Our criterion ∆ (D) 2 restricts probing to the principal Hessian subspace spanned by the top-D curvature directions. Garipov et al., 2018]. These observations suggest that how one probes local deformation should be part of the definition of stabiliza… view at source ↗

**Figure 2.** Figure 2: Decay of stabilization criteria with sample size. Comparison of ∆1, ∆2, and ∆ (D) 2 as functions of k. Criterion decay under sample growth. We first test whether the criteria from Sections 3–4 exhibit the predicted stabilization trend as the effective sample size grows. To do so, we evaluate the pointwise criterion ∆1, the isotropic mean-squared criterion ∆2, and the curvatureaware subspace criterion ∆ (… view at source ↗

**Figure 3.** Figure 3: Subspace criterion relative to the fullspace criterion. Ratio ∆ (D) 2 /∆2 across sample size k, for several dimensions D and scales σ. Subspace versus full-space criterion. We next test whether the curvature-aligned subspace criterion preserves the full-space mean-squared signal. For several values of D and σ, we track the ratio ∆ (D) 2 /∆2 as a function of k [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Relative error of the quadratic Taylor approximation versus perturbation scale σ (mean and standard deviation across seeds, nanochat d6, step 3500) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: shows that both Monte Carlo estimators converge toward the GM value as S increases [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Relative error |∆b direct − GM|/(|∆b direct| + ε) over subspace dimension D and perturbation scale σ. The top-D principal-curvature subspace is motivated by empirical anisotropy and supported by Proposition 3, but this extremality result holds only in an idealized simultaneously-diagonalizable quadratic regime. More general non-quadratic or rapidly drifting regimes may favor adaptive subspaces, and Assum… view at source ↗

read the original abstract

Local loss-landscape stabilization under sample growth is typically measured either pointwise or through isotropic averaging in the full parameter space. Despite practical value, both choices probe directions that contribute little to the dominant local deformation of strongly anisotropic neural landscapes. We recast stabilization as an observational problem and introduce a unified family of criteria parameterized by an aggregation order and a probing distribution; within this family we propose a curvature-aligned criterion $\Delta_2^{(D)}$ that probes the loss increment field in the top-$D$ eigenspace of the empirical Hessian near a trained solution. Solely from a local quadratic model, we prove that $\Delta_2^{(D)}$ preserves the $O(k^{-2})$ mean-squared rate of the full-space criterion while replacing ambient-dimension curvature dependence with dependence on the subspace dimension $D$; a corollary gives a closed-form spectral expression and a proposition identifies the top-$D$ eigenspace as extremal within the eigenspace-aligned family. We also derive scalable estimators based on Hessian-vector products, subspace Monte Carlo, and a closed-form Gaussian-moment proxy. On a decoder-only transformer, a curvature-aligned probe occupying a tiny fraction of parameter space already reproduces the full-space mean-squared signal to within numerical noise throughout the validated local regime, and the closed-form estimator is orders of magnitude faster than direct Monte Carlo after subspace construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves that a curvature-aligned probe in the top-D Hessian eigenspace preserves the O(k^{-2}) mean-squared rate of full-space stabilization criteria under a local quadratic model, with closed-form expressions and fast estimators that match full-space signals on a transformer.

read the letter

The central result is a self-contained derivation: starting from a quadratic loss model, their Δ₂^(D) criterion keeps the same O(k^{-2}) rate as the isotropic full-space version while depending only on subspace dimension D rather than ambient parameter count. They also supply a spectral closed form, an extremality proposition for the top-D eigenspace within their family, and three scalable estimators (Hessian-vector products, subspace Monte Carlo, and a Gaussian-moment proxy). The transformer experiment shows the low-dimensional probe reproduces the full-space mean-squared signal to numerical noise inside the local regime, and the closed-form version is much faster after subspace setup.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces a unified family of criteria for measuring local loss-landscape stabilization parameterized by aggregation order and probing distribution. It proposes the curvature-aligned criterion Δ₂^(D), which probes the loss increment field in the top-D eigenspace of the empirical Hessian near a trained solution. Solely under a local quadratic loss model, the paper proves that Δ₂^(D) preserves the O(k^{-2}) mean-squared rate of the full-space isotropic criterion while replacing ambient-dimension curvature dependence with explicit dependence on subspace dimension D. It supplies a closed-form spectral corollary, an extremality proposition within the eigenspace-aligned family, scalable estimators (Hessian-vector products, subspace Monte Carlo, Gaussian-moment proxy), and empirical validation on a decoder-only transformer showing numerical agreement with the full-space signal inside the validated local regime.

Significance. If the central derivation holds, the work supplies a principled, dimension-reduced probe that retains the convergence rate of full-space stabilization measures while focusing on dominant curvature directions in anisotropic neural landscapes. The self-contained quadratic-model derivation, closed-form estimators, and falsifiable empirical prediction (reproduction of full-space signal to numerical noise) constitute clear strengths. This could enable more scalable analysis of local geometry and stabilization in large models where ambient-dimension computations are prohibitive.

minor comments (3)

[Abstract] Abstract: the symbol k is used in the rate statement without prior definition; add a parenthetical clarifying that k indexes sample size or iteration count in the stabilization criterion.
[§3] §3 (theoretical results): the transition from the full-space isotropic baseline to the curvature-aligned family could include an explicit one-line reminder of the baseline expression before stating the preservation result, to aid readers.
[Experiments] Experiments section: the choice of D and the precise construction of the top-D eigenspace (e.g., via power iteration or Lanczos) should be stated with a short implementation note or reference to the estimator subsection.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and detailed summary of our work, as well as the recognition of its potential to enable scalable analysis of local geometry in large models. We are pleased that the central derivation, closed-form estimators, and empirical predictions are viewed as strengths. As no specific major comments were raised in the report, we have no point-by-point revisions to address at this time.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from quadratic model

full rationale

The central result is a direct mathematical proof that, under an exact local quadratic loss model, the curvature-aligned criterion Δ₂^(D) retains the O(k^{-2}) mean-squared rate of the full-space isotropic probe while substituting ambient-dimension dependence for explicit D-dependence. The derivation begins from the stated quadratic assumption, supplies a closed-form spectral corollary and an extremality proposition, and does not reduce any claimed rate or quantity to a fitted parameter, self-citation, or definitional tautology. Scalable estimators and empirical checks on the transformer are presented separately from the theoretical preservation result and do not feed back into it. No load-bearing step collapses to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a local quadratic approximation of the loss and the choice of subspace dimension D; the new criterion itself is introduced without independent empirical grounding outside the paper.

free parameters (1)

D
Subspace dimension that replaces full ambient dimension in the rate bound; selected by the user.

axioms (1)

domain assumption Loss near a trained solution admits a local quadratic approximation whose curvature is captured by the empirical Hessian
Invoked to prove preservation of the O(k^{-2}) mean-squared rate for the subspace probe.

invented entities (1)

curvature-aligned criterion Δ₂^(D) no independent evidence
purpose: Probes loss increment field restricted to top-D Hessian eigenspace
Newly defined member of the proposed family of stabilization criteria.

pith-pipeline@v0.9.0 · 5538 in / 1297 out tokens · 34015 ms · 2026-05-10T11:50:23.955797+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

doi: 10.1134/S1064562424601987

ISSN 1531-8362. doi: 10.1134/S1064562424601987. URLhttps://doi.org/10.1134/S1064562424601987. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualiz- ing the loss landscape of neural nets. InAdvances in Neural Information Pro- cessing Systems, 2018a. URL https://proceedings.neurips.cc/paper/2018/hash/ a41b3bb3e6b050b6c9067c67f663b91...

work page doi:10.1134/s1064562424601987 2018
[2]

Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

work page arXiv 2010
[3]

doi: 10.1109/BigData50022.2020.9378171. A Proofs A.1 Spectral interpretation under stable principal directions The rate bound in Theorem 2 does not require any alignment between the eigenspaces of H(k)(w∗ k) and H(k+1)(w∗ k). For interpretation, it is useful to isolate a more structured regime in which the leading curvature directions remain stable across...

work page doi:10.1109/bigdata50022.2020.9378171 2020

[1] [1]

doi: 10.1134/S1064562424601987

ISSN 1531-8362. doi: 10.1134/S1064562424601987. URLhttps://doi.org/10.1134/S1064562424601987. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualiz- ing the loss landscape of neural nets. InAdvances in Neural Information Pro- cessing Systems, 2018a. URL https://proceedings.neurips.cc/paper/2018/hash/ a41b3bb3e6b050b6c9067c67f663b91...

work page doi:10.1134/s1064562424601987 2018

[2] [2]

Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

work page arXiv 2010

[3] [3]

doi: 10.1109/BigData50022.2020.9378171. A Proofs A.1 Spectral interpretation under stable principal directions The rate bound in Theorem 2 does not require any alignment between the eigenspaces of H(k)(w∗ k) and H(k+1)(w∗ k). For interpretation, it is useful to isolate a more structured regime in which the leading curvature directions remain stable across...

work page doi:10.1109/bigdata50022.2020.9378171 2020