pith. sign in

arxiv: 2606.30226 · v1 · pith:GDYOGAJOnew · submitted 2026-06-29 · 💻 cs.LG

Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization

Pith reviewed 2026-06-30 07:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords Hessian eigenvectorsoptimizer dynamicsSGDAdamneural network trainingeigenvector displacementlocalizationinverse participation ratio
0
0 comments X

The pith

SGD stabilizes leading Hessian eigenvectors over training while Adam drives ongoing reorganization and localization to a small parameter subset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how the leading eigenvectors of the Hessian matrix, which point to the directions of greatest loss curvature, change as multilayer perceptrons train on classification tasks. It quantifies these changes with two statistics: eigenvector displacement across training steps and localization through the inverse participation ratio, both compared against a random null model generated by the network architecture alone. Results show that SGD produces progressively more stable leading curvature directions, whereas Adam produces stronger ongoing reorganization of those directions together with a localization effect in which a few parameters dominate the curvature. A sympathetic reader would care because these eigenvector-level statistics reveal optimizer differences that eigenvalue magnitudes alone do not capture and therefore offer a finer-grained view of how different optimizers shape training trajectories.

Core claim

By tracking the displacement of leading Hessian eigenvectors and their localization via the inverse participation ratio against an architecture-induced random null model, the work establishes that SGD leads to progressively more stable leading curvature directions during training of multilayer perceptrons on classification, while Adam exhibits substantially stronger reorganization of eigenvectors throughout training together with a localization phenomenon in which a small subset of parameters contributes disproportionately to the leading curvature directions.

What carries the argument

Displacement of leading Hessian eigenvectors over training steps and their localization measured by the inverse participation ratio, benchmarked against a random null model of the Hessian induced by the architecture.

If this is right

  • Optimizer selection affects not only loss decrease rate but also the temporal stability and spatial concentration of the dominant curvature directions.
  • Adam's continued eigenvector reorganization implies that its adaptive per-parameter steps keep altering which directions dominate curvature even late in training.
  • Localization under Adam indicates that curvature becomes concentrated on fewer effective parameters, potentially changing the effective dimensionality of the optimization problem.
  • Eigenvector dynamics supply a diagnostic that distinguishes training trajectories beyond what loss curves or eigenvalue spectra alone reveal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If stable eigenvectors correlate with flatter regions, the SGD pattern could help explain its often-reported generalization advantage over Adam.
  • The same displacement and localization statistics could be applied to other architectures to test whether the optimizer contrast persists beyond MLPs.
  • If the null model under-corrects for architecture effects, the reported optimizer-specific signals would shrink and the localization claim would need re-examination on matched architectures.

Load-bearing premise

The random null model of the Hessian induced by the architecture supplies a baseline that lets observed displacement and localization be attributed to the optimizer rather than to the network architecture itself.

What would settle it

Re-running the same MLP training experiments and finding that the time series of eigenvector displacement and inverse-participation-ratio values under SGD and Adam are statistically indistinguishable from each other and from the architecture null model.

Figures

Figures reproduced from arXiv: 2606.30226 by Marcelina Marjankowska, Paolo Barucca, Valerio Modugno.

Figure 1
Figure 1. Figure 1: Two-time mean square displacement, ∆(tw, tw + t) as defined in Equation 3. Each curve tw corresponds to the leading eigenvector at time tw and its change after waiting t steps. The three plots correspond to the three optimizers. Red dashed line indicates the random baseline. 0.1 0.2 update alignment 0.1 0.2 gradient alignment MNIST 0.1 0.2 0.3 update alignment 0.1 0.2 0.3 gradient alignment FashionMNIST Op… view at source ↗
Figure 2
Figure 2. Figure 2: Dot product of normalized gradient and update with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inverse Participation Ratio of the leading eigenvector throughout training. Mean and min [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss (log-scale). Mean and min-max over 5 random initializations. Adam [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Test accuracy across multiple runs, where each point in the scatter plot corresponds to [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-3 hessian eigenvalues throughout training. Mean and min-max shown over random [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Convergence of the top-3 eigenvectors to the final minimum basin. This is dot product of [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of best minima (over 5 runs) between optimizers. We consider the test+train [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training on the FashionMNIST dataset. Two-time mean square displacement, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Two-time weight displacement metric ∆weights(tw, tw + t) as defined in [6], without normalization by gradient noise. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The alignment of weight change with the leading eigenvector direction. More precisely, [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of raw alignment of (normalized) gradient and update with the top curvature [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Alignment of (normalized) gradient and update with the top curvature directions. Points [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Alignment of (normalized) gradient and update with the top curvature directions. Points [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Inverse participation ratio (IPR) of the top 15 eigenvectors throughout training for a [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of the reshaped leading eigenvector at the end of training on MNIST. The [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Dot product of normalized gradient and update with [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Inverse participation ratio (IPR) of the leading eigenvector throughout training for vary [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Inverse participation ratio (IPR) of the leading eigenvector throughout training for vary [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Two-time mean square displacement, ∆(i) (tw, tw + t) as defined in Equation 3. Each curve tw corresponds to the leading eigenvector at time tw and its diffusion after waiting t steps. Number of training steps adjusted to be comparable between differnt batch size runs. Red dashed line indicates the random baseline. Observations from the main text still hold, with the exception of full-batch setting (batch … view at source ↗
Figure 21
Figure 21. Figure 21: Two-time mean square displacement, ∆(i) (tw, tw + t) as defined in Equation 3. Each curve tw corresponds to the leading eigenvector at time tw and its diffusion after waiting t steps. Red dashed line indicates the random baseline. The learning rate doesn’t impact the differences in eigenvector displacement across optimizers. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
read the original abstract

Hessian spectral properties are a standard tool in analysing neural-network training, with eigenvalues linked to sharpness, generalization, and optimization dynamics. Eigenvalues quantify curvature magnitude, while eigenvectors identify which parameters generate that curvature. In this work, we study how the leading Hessian eigenvectors evolve during training and how they affect the learning trajectories. We track the training dynamics of multilayer perceptrons on a classification problem and measure eigenvector dynamics through two complementary statistics: (i) displacement over time, inspired by analyses of glassy systems, and (ii) localization via the inverse participation ratio. The metrics are compared against a random null model of the Hessian induced by the architecture. Our results reveal clear optimizer-dependent behaviour. SGD leads to progressively more stable leading curvature directions, while Adam exhibits substantially stronger reorganization of eigenvectors throughout training. We also observe a localization phenomenon under Adam, where a small subset of parameters contributes disproportionately to the leading curvature directions. These results suggest that Hessian eigenvector dynamics capture key differences in optimizer behaviour and the resulting training trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes the evolution of leading Hessian eigenvectors during training of multilayer perceptrons on classification tasks, using displacement (inspired by glassy systems) and inverse participation ratio to quantify dynamics. It reports optimizer-dependent patterns when compared to a random null model of the Hessian induced by the architecture: SGD produces progressively more stable leading curvature directions, while Adam exhibits stronger eigenvector reorganization throughout training and a localization effect in which a small subset of parameters dominates the leading curvature directions.

Significance. If the central attribution to optimizers holds after proper validation of the baseline, the work would provide a useful spectral characterization of how SGD versus Adam shape loss-landscape geometry, extending beyond eigenvalue magnitudes to eigenvector stability and localization. The complementary metrics and explicit null-model comparison constitute a concrete, falsifiable approach that could inform optimizer analysis and training-trajectory studies.

major comments (1)
  1. [Null model (abstract and results)] Null-model construction and validation (abstract and results sections describing the random null model): The claim that observed displacement and localization differences are optimizer-specific rather than architecture- or initialization-driven rests on comparison to the random null model. No explicit description is given of how the null model is generated (e.g., weight randomization procedure, preservation or destruction of layer-wise correlations and activation-induced structure), nor is any validation reported (moment matching, tests on untrained or frozen networks). This is load-bearing for the central claim; without it the optimizer attribution cannot be isolated.
minor comments (2)
  1. [Abstract] Abstract and results: quantitative details such as number of independent runs, error bars on displacement and IPR statistics, dataset sizes, and architecture controls are not reported, making it difficult to assess the robustness of the qualitative differences described.
  2. [Methods] The manuscript would benefit from an explicit statement of how many leading eigenvectors are tracked and whether results are sensitive to this choice.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of the null-model details. We agree this is essential for isolating optimizer effects and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Null-model construction and validation (abstract and results sections describing the random null model): The claim that observed displacement and localization differences are optimizer-specific rather than architecture- or initialization-driven rests on comparison to the random null model. No explicit description is given of how the null model is generated (e.g., weight randomization procedure, preservation or destruction of layer-wise correlations and activation-induced structure), nor is any validation reported (moment matching, tests on untrained or frozen networks). This is load-bearing for the central claim; without it the optimizer attribution cannot be isolated.

    Authors: We agree the current manuscript provides insufficient detail on the null model. In the revised version we will add a dedicated Methods subsection specifying that the null model is generated by independently randomly permuting weights within each layer (preserving per-layer marginal distributions and architecture but destroying learned correlations and activation-induced structure). We will also report validation: (i) moment matching of the leading eigenvalues between the empirical Hessian and null model on untrained networks, and (ii) explicit checks that displacement and IPR statistics coincide between real and null Hessians before training begins. These additions will make the optimizer attribution explicit and falsifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of Hessian eigenvector statistics against architecture null model

full rationale

The paper reports direct empirical measurements of eigenvector displacement (inspired by glassy systems) and inverse participation ratio on MLPs trained with SGD vs Adam, compared against a random null model induced by the architecture. No equations, fitted parameters, or self-citations are shown to reduce the reported statistics to inputs by construction; the central claims rest on observed differences in these quantities rather than any self-referential derivation or renamed known result. The analysis is self-contained as observational data against an external baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard definitions of the Hessian and the inverse participation ratio; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption The Hessian matrix of the loss captures the local curvature relevant to optimization dynamics.
    Invoked implicitly when linking eigenvalues/eigenvectors to sharpness and training trajectories.
  • standard math The inverse participation ratio is a valid scalar measure of eigenvector localization.
    Used without derivation as a standard tool from physics.

pith-pipeline@v0.9.1-grok · 5708 in / 1224 out tokens · 22453 ms · 2026-06-30T07:51:28.790727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 26 canonical work pages · 8 internal anchors

  1. [1]

    SAM operates far from home: eigenvalue regularization as a dynamical phenomenon

    Atish Agarwala and Yann Dauphin. SAM operates far from home: eigenvalue regularization as a dynamical phenomenon. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023. URLhttps://dl.acm.org/doi/10.5555/ 3618408.3618416

  2. [2]

    Negative eigenvalues of the Hessian in deep neural networks

    Guillaume Alain, Nicolas Le Roux, and Pierre-Antoine Manzagol. Negative eigenvalues of the Hessian in deep neural networks, February 2019. URLhttp://arxiv.org/abs/ 1902.02366

  3. [3]

    Eigenvector dynamics: General theory and some applications.Physical Review E, 86(4):046202, October 2012

    Romain Allez and Jean-Philippe Bouchaud. Eigenvector dynamics: General theory and some applications.Physical Review E, 86(4):046202, October 2012. ISSN 1539-3755, 1550-2376. URLhttps://link.aps.org/doi/10.1103/PhysRevE.86.046202

  4. [4]

    Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD, December 2025

    Arseniy Andreyev and Pierfrancesco Beneventano. Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD, December 2025. URLhttp://arxiv.org/abs/2412. 20553

  5. [5]

    High- dimensional SGD aligns with emerging outlier eigenspaces

    Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, and Aukosh Jagannath. High- dimensional SGD aligns with emerging outlier eigenspaces. InThe Twelfth International Con- ference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=MHjigVnI04

  6. [6]

    Comparing Dynamics: Deep Neural Networks versus Glassy Systems

    Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, Gerard Ben Arous, Chiara Cammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing Dynamics: Deep Neural Networks versus Glassy Systems. InProceedings of the 35th International Conference on Machine Learning, pages 314–323. PMLR, July 2018. URLhttps://proceedings. mlr.press/v80/baity...

  7. [7]

    Hessian inertia in neu- ral networks

    Xuchan Bao, Alberto Bietti, Aaron Defazio, and Vivien Cabannes. Hessian inertia in neu- ral networks. InProceedings of the 1st Workshop on High-dimensional Learning Dynamics (HiLD), International Conference on Machine Learning (ICML), 2023. Poster presentation. 6 OPTIMIZER-DEPENDENTHESSIANDYNAMICS

  8. [8]

    Universal characteristics of deep neural network loss surfaces from ran- dom matrix theory.Journal of Physics A: Mathematical and Theoretical, 55(49):494002, December 2022

    Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, Joseph Najnudel, and Diego Granziol. Universal characteristics of deep neural network loss surfaces from ran- dom matrix theory.Journal of Physics A: Mathematical and Theoretical, 55(49):494002, December 2022. ISSN 1751-8113, 1751-8121. doi: 10.1088/1751-8121/aca7f5. URL https://iopscience.io...

  9. [9]

    Bonnaire, G

    Tony Bonnaire, Giulio Biroli, and Chiara Cammarota. The Role of the time-Dependent Hes- sian in High-Dimensional Optimization.Journal of Statistical Mechanics: Theory and Exper- iment, 2025(8):083401, 2025. URLhttps://arxiv.org/abs/2403.02418

  10. [10]

    Sharp Minima Can Generalize For Deep Nets

    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp Minima Can Gen- eralize for Deep Nets. InInternational Conference on Machine Learning, pages 1019–1028. PMLR, 2017. URLhttps://arxiv.org/abs/1703.04933

  11. [11]

    HAWQ: Hessian AWare Quantization of neural networks with mixed-precision

    Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. HAWQ: Hessian AWare Quantization of neural networks with mixed-precision. InProceedings of the IEEE/CVF international conference on computer vision, pages 293–302, 2019. URLhttps: //arxiv.org/abs/1905.03696

  12. [12]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Mini- mization for Efficiently Improving Generalization, April 2021. URLhttp://arxiv.org/ abs/2010.01412

  13. [13]

    Emergent properties of the local geometry of neural loss landscapes, October 2019

    Stanislav Fort and Surya Ganguli. Emergent properties of the local geometry of neural loss landscapes, October 2019. URLhttp://arxiv.org/abs/1910.05929

  14. [14]

    An investigation into neural net opti- mization via hessian eigenvalue density, 2019

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net opti- mization via hessian eigenvalue density, 2019. URLhttps://arxiv.org/abs/1901. 10159

  15. [15]

    An investigation into Neural Net Opti- mization via Hessian Eigenvalue Density

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into Neural Net Opti- mization via Hessian Eigenvalue Density. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2232–2241. PMLR, 09–15 Jun 2019. URL ht...

  16. [16]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intel- ligence and Statistics, pages 249–256. JMLR Workshop and Conference Proceedings, March

  17. [17]

    URLhttps://proceedings.mlr.press/v9/glorot10a.html

  18. [18]

    Beyond Random Matrix Theory for Deep Networks, November 2021

    Diego Granziol. Beyond Random Matrix Theory for Deep Networks, November 2021. URL http://arxiv.org/abs/2006.07721

  19. [19]

    Gradient Descent Happens in a Tiny Subspace

    Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient Descent Happens in a Tiny Sub- space, December 2018. URLhttp://arxiv.org/abs/1812.04754

  20. [20]

    The bayesian evidence scheme for regularizing probability-density estimating neural networks.Neural computation, 12(11):2685–2717, 2000

    Dirk Husmeier. The bayesian evidence scheme for regularizing probability-density estimating neural networks.Neural computation, 12(11):2685–2717, 2000. URLhttps://pubmed. ncbi.nlm.nih.gov/11110132/. 7 OPTIMIZER-DEPENDENTHESSIANDYNAMICS

  21. [21]

    On the relation between the sharpest directions of DNN loss and the SGD step length

    Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the Sharpest Directions of DNN loss and the SGD Step Length, 2019. URLhttps://arxiv.org/abs/1807.05031

  22. [22]

    arXiv preprint arXiv:1912.02178 , year=

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fan- tastic generalization measures and where to find them. InInternational Conference on Learn- ing Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1912.02178

  23. [23]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima, 2017. URLhttps://arxiv.org/abs/1609.04836

  24. [24]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  25. [25]

    Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators.Journal of research of the National Bureau of Standards, 45(4):255–282, 1950

  26. [26]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

  27. [27]

    N. C. Murphy, R. Wortis, and W. A. Atkinson. Generalized inverse participation ratio as a possible measure of localization for interacting systems.Phys. Rev. B, 83:184206, May 2011. doi: 10.1103/PhysRevB.83.184206. URLhttps://link.aps.org/doi/10.1103/ PhysRevB.83.184206

  28. [28]

    Quang Nguyen and Ngoc-Kim-Khanh Nguyen. Spectral signatures of learning: Uncovering the localization phase transition in deep neural networks via random matrix theory.Physica A: Statistical Mechanics and its Applications, 692:131474, 2026. ISSN 0378-4371. doi: https: //doi.org/10.1016/j.physa.2026.131474. URLhttps://www.sciencedirect.com/ science/article/...

  29. [29]

    Eigenvectors of random matrices: A survey

    Sean O’Rourke, Van Vu, and Ke Wang. Eigenvectors of random matrices: A survey. Journal of Combinatorial Theory, Series A, 144:361–442, 2016. ISSN 0097-3165. doi: https://doi.org/10.1016/j.jcta.2016.06.008. URLhttps://www.sciencedirect.com/ science/article/pii/S0097316516300383. Fifty Years of the Journal of Combi- natorial Theory

  30. [30]

    Pearlmutter

    Barak A. Pearlmutter. Fast exact multiplication by the hessian.Neural Computation, 6(1): 147–160, 01 1994. ISSN 0899-7667. doi: 10.1162/neco.1994.6.1.147. URLhttps:// doi.org/10.1162/neco.1994.6.1.147

  31. [31]

    Delocalization of eigenvectors of random matri- ces with independent entries.Duke Mathematical Journal, 164(13), October 2015

    Mark Rudelson and Roman Vershynin. Delocalization of eigenvectors of random matri- ces with independent entries.Duke Mathematical Journal, 164(13), October 2015. ISSN 0012-7094. doi: 10.1215/00127094-3129809. URLhttp://dx.doi.org/10.1215/ 00127094-3129809

  32. [32]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2017. URLhttps://arxiv.org/abs/1611.07476. 8 OPTIMIZER-DEPENDENTHESSIANDYNAMICS

  33. [33]

    arXiv preprint arXiv:2405.16002 , year=

    Minhak Song, Kwangjun Ahn, and Chulhee Yun. Does SGD really happen in tiny subspaces?, March 2025. URLhttp://arxiv.org/abs/2405.16002

  34. [34]

    EigenDamage: Struc- tured Pruning in the Kronecker-Factored Eigenbasis

    Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. EigenDamage: Struc- tured Pruning in the Kronecker-Factored Eigenbasis. InProceedings of the 36th Inter- national Conference on Machine Learning, pages 6566–6575. PMLR, May 2019. URL https://proceedings.mlr.press/v97/wang19g.html

  35. [35]

    Lawrence Wang and Stephen J. Roberts. Training instabilities favor flatter solutions in gradient descent.Neural Networks, 201:108874, 2026. ISSN 0893-6080. URLhttps://www. sciencedirect.com/science/article/pii/S0893608026003357

  36. [36]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017. URLhttps://arxiv.org/abs/ 1708.07747

  37. [37]

    Hessian-based analysis of large batch training and robustness to adversaries.Advances in Neural Information Processing Systems, 31, 2018

    Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries.Advances in Neural Information Processing Systems, 31, 2018. URLhttps://arxiv.org/abs/1802.08241. 9 OPTIMIZER-DEPENDENTHESSIANDYNAMICS Appendix A. Related work Gradient alignment with the top Hessian subspace....

  38. [38]

    Gradient and update alignment naturally overlaps for SGD

    Points in the plot correspond to mean alignment values over different experiment initializations, shaded areas to min-max over 3 runs. Gradient and update alignment naturally overlaps for SGD. There is a sharp difference with the full-batch training. F.3. Learning rate impact results • Figure 17 presents thealignment across varying learning rates. We obse...