pith. machine review for the scientific record. sign in

arxiv: 2604.10202 · v2 · submitted 2026-04-11 · 💻 cs.LG · cs.AI· cs.NE

Recognition: unknown

Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords Hessian eigenspectrumcross-entropy lossneural network sharpnessWolkowicz-Styan boundsmooth activationsmultilayer networksloss geometrygeneralization
0
0 comments X

The pith

A closed-form upper bound exists on the largest Hessian eigenvalue of cross-entropy loss in smooth nonlinear multilayer neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a closed-form upper bound on the maximum eigenvalue of the Hessian for the cross-entropy loss in multilayer neural networks with smooth nonlinear activations. The bound is expressed explicitly in terms of the affine transformation parameters, hidden layer dimensions, and the degree of orthogonality among training samples. It is obtained by applying the Wolkowicz-Styan bound to the Hessian matrix, which avoids solving the characteristic equation numerically. A sympathetic reader would care because this analytical expression allows assessment of loss sharpness, linked to generalization, in general smooth architectures where prior closed-form results were unavailable.

Core claim

The authors demonstrate that the Wolkowicz-Styan bound yields a closed-form upper bound on the largest eigenvalue of the Hessian of the cross-entropy loss for nonlinear smooth multilayer neural networks. This bound is a function of the affine transformation parameters, the hidden layer dimensions, and the degree of orthogonality among the training samples. The result supplies an analytical characterization of loss sharpness at critical points without explicit numerical computation of the eigenspectrum.

What carries the argument

The Wolkowicz-Styan bound, an inequality that upper-bounds the largest eigenvalue of a matrix, applied directly to the Hessian of the cross-entropy loss.

Load-bearing premise

The Wolkowicz-Styan bound applies directly to the Hessian of the cross-entropy loss under the stated smoothness and multilayer architecture assumptions.

What would settle it

Numerically compute the true maximum eigenvalue of the Hessian for a small smooth nonlinear network such as a two-layer sigmoid model trained on a few samples, then check whether this value is always at most the closed-form bound given by the paper for the same parameters and orthogonality measure.

Figures

Figures reproduced from arXiv: 2604.10202 by Hirotaka Takahashi, Kazuki Sakai, Makoto Sasaki, Yohei Kakimoto, Yusuke Sakai, Yuto Omae.

Figure 1
Figure 1. Figure 1: The NN architecture analyzed in this study. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relationship between the maximum eigenvalues at [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Top) Decision boundaries in the input space for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows the dynamics of the loss L and the upper [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hessian (Matrix size [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of the trace between numerical solutions [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Relationship between the Frobenius norm of the [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Relationship between the Frobenius norm of the [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Activation function f(y) and its first and second derivatives. Since f ′′′(y) = 0 at f(y) = (3 ± √ 3)/6, the minimum and maximum of f ′′(y) are given by min y∈R f ′′(y) = − √ 3 18 = −0.096..., max y∈R f ′′(y) = √ 3 18 = 0.096... . (41) D. Hyperbolic Tangent Activation The tanh activation [21] is given by f(y) = tanh(y) ∈ (−1, 1). (42) From Eqs. (4.5.17) and (4.5.73) in [51], the first, second, and third-o… view at source ↗
read the original abstract

Neural networks (NNs) are central to modern machine learning and achieve state-of-the-art results in many applications. However, the relationship between loss geometry and generalization is still not well understood. The local geometry of the loss function near a critical point is well-approximated by its quadratic form, obtained through a second-order Taylor expansion. The coefficients of the quadratic term correspond to the Hessian matrix, whose eigenspectrum allows us to evaluate the sharpness of the loss at the critical point. Extensive research suggests flat critical points generalize better, while sharp ones lead to higher generalization error. However, sharpness requires the Hessian eigenspectrum, but general matrix characteristic equations have no closed-form solution. Therefore, most existing studies on evaluating loss sharpness rely on numerical approximation methods. Existing closed-form analyses of the eigenspectrum are primarily limited to simplified architectures, such as linear or ReLU-activated networks; consequently, theoretical analysis of smooth nonlinear multilayer neural networks remains limited. Against this background, this study focuses on nonlinear, smooth multilayer neural networks and derives a closed-form upper bound for the maximum eigenvalue of the Hessian with respect to the cross-entropy loss by leveraging the Wolkowicz-Styan bound. Specifically, the derived upper bound is expressed as a function of the affine transformation parameters, hidden layer dimensions, and the degree of orthogonality among the training samples. The primary contribution of this paper is an analytical characterization of loss sharpness in smooth nonlinear multilayer neural networks via a closed-form expression, avoiding explicit numerical eigenspectrum computation. We hope that this work provides a small yet meaningful step toward unraveling the mysteries of deep learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives a closed-form upper bound on the largest eigenvalue of the Hessian of the cross-entropy loss for smooth nonlinear multilayer neural networks by applying the Wolkowicz-Styan matrix inequality. The resulting expression depends on the affine transformation parameters, hidden-layer dimensions, and a measure of orthogonality among the training samples. The central claim is that this provides an analytical characterization of loss sharpness that avoids numerical eigenspectrum computation, extending beyond the linear or ReLU cases treated in prior work.

Significance. If the derivation is complete and the bound holds under the stated assumptions, the result would be a modest but useful contribution to the theoretical analysis of loss landscapes. Closed-form bounds on Hessian eigenvalues for general smooth activations are rare, and an explicit dependence on architecture and data orthogonality could support future work on sharpness and generalization. The approach of invoking a known trace/Frobenius-based inequality is appropriate in principle, though its practical value rests on whether the multilayer chain-rule structure can be controlled without extra assumptions.

major comments (2)
  1. [§3] §3 (main derivation): The application of the Wolkowicz-Styan bound to the full Hessian requires explicit intermediate steps showing that all cross-layer terms arising from the chain rule (Jacobians of activations and second-derivative blocks) are dominated by a matrix whose spectrum depends only on the claimed quantities. No such domination argument is supplied, leaving open whether the bound remains free of dependence on activation derivatives evaluated at the critical point.
  2. [Assumptions paragraph preceding Eq. (bound)] Assumptions paragraph preceding Eq. (bound): The smoothness and multilayer architecture assumptions are listed, but it is not shown that the Wolkowicz-Styan inequality applies directly to the composite Hessian without additional restrictions (e.g., bounded activation Hessians or a neighborhood around the critical point). This is load-bearing for the closed-form claim.
minor comments (2)
  1. [Notation section] The precise definition of the 'degree of orthogonality' among samples should be given as an explicit equation or norm rather than left as a descriptive phrase.
  2. [Introduction] A brief comparison table or remark contrasting the new bound with existing closed-form results for linear networks would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the major comments point by point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses
  1. Referee: [§3] §3 (main derivation): The application of the Wolkowicz-Styan bound to the full Hessian requires explicit intermediate steps showing that all cross-layer terms arising from the chain rule (Jacobians of activations and second-derivative blocks) are dominated by a matrix whose spectrum depends only on the claimed quantities. No such domination argument is supplied, leaving open whether the bound remains free of dependence on activation derivatives evaluated at the critical point.

    Authors: We agree that the original derivation in Section 3 would be strengthened by additional explicit steps. In the revised manuscript we will insert a detailed expansion of the chain-rule expansion of the Hessian, followed by term-by-term application of the smoothness assumptions to bound the cross-layer Jacobians and second-derivative blocks. Each bound will be shown to be dominated by a matrix whose eigenvalues depend only on the affine parameters, hidden-layer dimensions, and the training-sample orthogonality measure, thereby removing any residual dependence on the pointwise values of the activation derivatives. revision: yes

  2. Referee: [Assumptions paragraph preceding Eq. (bound)] Assumptions paragraph preceding Eq. (bound): The smoothness and multilayer architecture assumptions are listed, but it is not shown that the Wolkowicz-Styan inequality applies directly to the composite Hessian without additional restrictions (e.g., bounded activation Hessians or a neighborhood around the critical point). This is load-bearing for the closed-form claim.

    Authors: The referee correctly identifies that the direct applicability of the Wolkowicz-Styan inequality to the composite Hessian needs explicit justification. We will revise the assumptions paragraph to include a short lemma establishing that, under the stated smoothness and multilayer architecture conditions, the Hessian admits a decomposition to which the inequality applies globally; no auxiliary bounds on activation Hessians or localization to a neighborhood are required beyond the smoothness already assumed. revision: yes

Circularity Check

0 steps flagged

No circularity: external Wolkowicz-Styan bound applied to explicitly constructed Hessian

full rationale

The paper constructs the Hessian of cross-entropy loss via the chain rule for a smooth nonlinear multilayer network, then applies the known Wolkowicz-Styan matrix inequality (an external result from 1979) to bound its largest eigenvalue. The resulting closed-form expression depends on affine parameters, layer dimensions, and sample orthogonality because those quantities appear in the explicit Hessian blocks; this dependence is not smuggled in by definition or by renaming a fitted quantity. No self-citation is load-bearing for the central step, no ansatz is adopted via prior work of the same authors, and no prediction reduces to a fitted input by construction. The derivation remains self-contained against the external bound and the standard chain-rule expansion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the contribution rests on application of the pre-existing Wolkowicz-Styan bound to the Hessian of cross-entropy loss.

pith-pipeline@v0.9.0 · 5623 in / 1101 out tokens · 44053 ms · 2026-05-10T16:09:01.883247+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    J. Chai, H. Zeng, A. Li, E. W. T. Ngai, Deep learning in computer vision: A critical review of emerging techniques and application scenarios, Machine Learning with Applications 6 (2021) 100134.doi:10. 1016/j.mlwa.2021.100134

  2. [2]

    E. O. Arkhangelskaya, S. I. Nikolenko, Deep Learning for Natural Language Processing: A Survey, Journal of Mathematical Sciences 273 (4) (2023) 533–582.doi:10.1007/s10958-023-06519-6

  3. [3]

    Mehrish, N

    A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, S. Poria, A re- view of deep learning techniques for speech processing, Information Fu- sion 99 (2023) 101869.doi:10.1016/j.inffus.2023.101869

  4. [4]

    L. Dinh, R. Pascanu, S. Bengio, Y . Bengio, Sharp Minima Can Gener- alize For Deep Nets, Proceedings of the 34th International Conference on Machine Learning (2017).doi:arXiv:1703.04933

  5. [5]

    X. Yue, M. Nouiehed, R. A. Kontar, SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization, IEEE Transactions on Neural Networks and Learning Systems 35 (9) (2024) 12518–12527. arXiv:2011.05348,doi:10.1109/TNNLS.2023.3263393

  6. [6]

    Flat Minima , year =

    S. Hochreiter, J. Schmidhuber, Flat minima, Neural Computation 9 (1) (1997) 1–42.doi:10.1162/neco.1997.9.1.1

  7. [7]

    Y . Liu, S. Yu, T. Lin, Hessian regularization of deep neural networks: A novel approach based on stochastic estimators of Hessian trace, Neurocomputing 536 (2023) 13–20.doi:10.1016/j.neucom. 2023.03.017

  8. [8]

    Arora, Z

    S. Arora, Z. Li, A. Panigrahi, Understanding Gradient Descent on Edge of Stability in Deep Learning (2022).arXiv:2205.09745,doi: 10.48550/arXiv.2205.09745

  9. [9]

    K. Lyu, Z. Li, S. Arora, Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction (2023).arXiv:2206. 07085,doi:10.48550/arXiv.2206.07085

  10. [10]

    C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, Journal of Research of the National Bureau of Standards 45 (4) (1950) 255–282

  11. [11]

    M. Hutchinson, A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines, Communications in Statistics - Simulation and Computation 18 (3) (1989) 1059–1076.doi:10. 1080/03610918908812806

  12. [12]

    Z. Yao, A. Gholami, K. Keutzer, M. W. Mahoney, PyHessian: Neural Networks Through the Lens of the Hessian, 2020 IEEE International Conference on Big Data (Big Data) (2020) 581–590doi:10.1109/ BigData50022.2020.9378171

  13. [13]

    Ghorbani, S

    B. Ghorbani, S. Krishnan, Y . Xiao, An Investigation into Neural Net Optimization via Hessian Eigenvalue Density, Proceedings of the 36th International Conference on Machine Learning (2019) 2232–2241

  14. [14]

    Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. W. Mahoney, K. Keutzer, HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Net- works, Advances in Neural Information Processing Systems 33 (2020) 18518–18529

  15. [15]

    S. P. Singh, W. Ormaniec, T. Hofmann, Cracking the Hessian: Closed- Form Hessian Spectra for Fundamental Neural Networks, OpenReview in ICLR2026 (2026)

  16. [16]

    Wolkowicz, G

    H. Wolkowicz, G. P. H. Styan, Bounds for eigenvalues using traces, Linear Algebra and its Applications 29 (1980) 471–506.doi:10. 1016/0024-3795(80)90258-X

  17. [17]

    J. K. Merikoski, A. Virtanen, Bounds for eigenvalues using the trace and determinant, Linear Algebra and its Applications 264 (1997) 101–108. doi:10.1016/S0024-3795(97)00067-0. PREPRINT 19

  18. [18]

    J. K. Merikoski, A. Virtanen, Best possible bounds for ordered positive numbers using their sum and product, Mathematical Inequalities & Applications 4 (1) (2001) 67–84

  19. [19]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    R. Sharma, R. Kumar, R. Saini, Note on Bounds for Eigenvalues us- ing Traces (2014).arXiv:1409.0096,doi:10.48550/arXiv. 1409.0096

  20. [20]

    A. A. Minai, R. D. Williams, On the derivatives of the sigmoid, Neural Networks 6 (6) (1993) 845–853.doi:10.1016/S0893-6080(05) 80129-7

  21. [21]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, A. Courville, Deep Learning, MIT Press (2016)

  22. [22]

    Berzal, DL101 Neural Network Outputs and Loss Functions (2025)

    F. Berzal, DL101 Neural Network Outputs and Loss Functions (2025). arXiv:2511.05131,doi:10.48550/arXiv.2511.05131

  23. [23]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks, K. Gimpel, Gaussian Error Linear Units (GELUs) (2016). doi:10.48550/arXiv.1606.08415

  24. [24]

    H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the Loss Landscape of Neural Nets, Advances in Neural Information Processing Systems 31 (2018)

  25. [25]

    M. Wei, D. J. Schwab, How noise affects the Hessian spectrum in overparameterized neural networks (2019).arXiv:1910.00195, doi:10.48550/arXiv.1910.00195

  26. [26]

    L. Wu, W. J. Su, The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent (2023).arXiv:2305.17490,doi: 10.48550/arXiv.2305.17490

  27. [27]

    Marion, L

    P. Marion, L. Chizat, Deep linear networks for regression are implicitly regularized towards flat minima (2024).arXiv:2405.13456,doi: 10.48550/arXiv.2405.13456

  28. [28]

    Damian, T

    A. Damian, T. Ma, J. D. Lee, Label Noise SGD Provably Prefers Flat Global Minimizers, in: Advances in Neural Information Processing Systems, V ol. 34, Curran Associates, Inc., 2021, pp. 27449–27461

  29. [29]

    H. Liu, S. M. Xie, Z. Li, T. Ma, Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models, Proceedings of the 40th International Conference on Machine Learning (2023) 22188–22214

  30. [30]

    A. R. Sankar, Y . Khasbage, R. Vigneswaran, V . N Balasubrama- nian, A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (11) (2021) 9481–9488. doi:10.1609/aaai.v35i11.17142

  31. [31]

    Bolshim, A

    M. Bolshim, A. Kugaevskikh, Local properties of neural networks through the lens of layer-wise Hessians (2025).arXiv:2510.17486, doi:10.48550/arXiv.2510.17486

  32. [32]

    H. R. Zhang, D. Li, H. Ju, Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach (2024).arXiv: 2306.08553,doi:10.48550/arXiv.2306.08553

  33. [33]

    Y . Zhou, Y . Li, L. Feng, S.-J. Huang, Improving generalization of deep neural networks by optimum shifting, Proceedings of the Thirty- Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence 39 (2025) 10...

  34. [34]

    H. Luo, T. Truong, T. Pham, M. Harandi, D. Phung, T. Le, Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization (2025).arXiv:2501.12666,doi:10.48550/arXiv.2501. 12666

  35. [35]

    Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10

    C. Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10. 1162/neco.1992.4.4.494

  36. [36]

    Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

    P. Foret, A. Kleiner, H. Mobahi, B. Neyshabur, Sharpness-Aware Min- imization for Efficiently Improving Generalization (2021).arXiv: 2010.01412,doi:10.48550/arXiv.2010.01412

  37. [37]

    J. Kwon, J. Kim, H. Park, I. K. Choi, ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, Proceedings of the 38th International Conference on Machine Learning (2021) 5905–5914

  38. [38]

    Y . Zhou, Y . Qu, X. Xu, H. Shen, ImbSAM: A Closer Look at Sharpness- Aware Minimization in Class-Imbalanced Recognition, 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023) 11311– 11321doi:10.1109/ICCV51070.2023.01042

  39. [39]

    Andriushchenko, N

    M. Andriushchenko, N. Flammarion, Towards Understanding Sharpness- Aware Minimization, Proceedings of the 39th International Conference on Machine Learning (2022) 639–668

  40. [40]

    arXiv:2106.01548 , year=

    X. Chen, C.-J. Hsieh, B. Gong, When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations (2022). arXiv:2106.01548,doi:10.48550/arXiv.2106.01548

  41. [41]

    Huang, X

    W. Huang, X. Liu, X. Wang, J. Yamagishi, Y . Qian, From Sharpness to Better Generalization for Speech Deepfake Detection (2025).arXiv: 2506.11532,doi:10.48550/arXiv.2506.11532

  42. [42]

    S. P. Singh, G. Bachmann, T. Hofmann, Analytic Insights into Structure and Rank of Neural Network Hessian Maps (2021).arXiv:2106. 16225,doi:10.48550/arXiv.2106.16225

  43. [43]

    Suryadi, L. Y . Chew, Y .-S. Ong, Jacobian Granger causality for count and binary data with applications to causal network inference, Scientific Re- ports 16 (1) (2025) 3452.doi:10.1038/s41598-025-33385-w

  44. [44]

    J. S. Tyler, The Laguerre–Samuelson Inequality with Extensions and Ap- plications in Statistics and Matrix Theory, Department of Mathematics and Statistics, McGill University (1999)

  45. [45]

    P. A. Samuelson, How Deviant Can You Be?, Journal of the American Statistical Association 63 (324) (1968) 1522–1525.doi:10.1080/ 01621459.1968.10480944

  46. [46]

    Y . Wu, X. Zhu, C. Wu, A. Wang, R. Ge, Dissecting Hessian: Under- standing Common Structure of Hessian in Neural Networks (2022). arXiv:2010.04261,doi:10.48550/arXiv.2010.04261

  47. [47]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    L. Sagun, L. Bottou, Y . LeCun, Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond (2017).arXiv:1611.07476, doi:10.48550/arXiv.1611.07476

  48. [48]

    On the power-law hessian spectrums in deep learning, 2022

    Z. Xie, Q.-Y . Tang, Y . Cai, M. Sun, P. Li, On the Power-Law Hessian Spectrums in Deep Learning (2022).arXiv:2201.13011,doi: 10.48550/arXiv.2201.13011

  49. [49]

    Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

    V . Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

  50. [50]

    I. J. Goodfellow, O. Vinyals, A. M. Saxe, Qualitatively characterizing neural network optimization problems (2015).arXiv:1412.6544, doi:10.48550/arXiv.1412.6544

  51. [51]

    Abramowitz, I

    M. Abramowitz, I. A. Stegun, Handbook of Mathematical Functions, Dover Publications (1965)

  52. [52]

    K. B. Petersen, M. S. Pedersen, The Matrix Cookbook, Technical University of Denmark (2012)