Stochastic Estimation of the Layer-wise Hessian Trace for Monitoring Neural-network Training

Alexander Kugaevskikh (1) ((1) ITMO University; Maxim Bolshim (1); Russia); Saint Petersburg

arxiv: 2605.25674 · v1 · pith:2KC4EOC4new · submitted 2026-05-25 · 💻 cs.LG

Stochastic Estimation of the Layer-wise Hessian Trace for Monitoring Neural-network Training

Maxim Bolshim (1) , Alexander Kugaevskikh (1) ((1) ITMO University , Saint Petersburg , Russia) This is my paper

Pith reviewed 2026-06-29 22:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic hessian tracelayer-wise curvaturehutchinson estimatorneural network monitoringweight sharing biaslabel memorization detectionmini-batch variance

0 comments

The pith

A stochastic procedure recovers unbiased per-layer Hessian traces from one backward pass by pairing Hutchinson's estimator with a single full-parameter Hessian-vector product.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to estimate the trace of each diagonal block of the Hessian of a neural network's empirical risk at scales where explicit computation is impossible. Loss and gradient norm separate healthy from pathological training only weakly, while curvature differs qualitatively but remains inaccessible. The procedure combines the Hutchinson stochastic trace estimator with one Hessian-vector product over the full parameter vector to obtain unbiased per-layer traces in a single backward pass. It derives a closed-form variance expression, decomposes total variance under mini-batch sampling to identify a critical probe count, and shows the resulting estimates can detect label memorization via a cumulative-sum rule.

Core claim

The procedure combines the Hutchinson stochastic trace estimator with a single Hessian-vector product over the whole parameter vector and recovers unbiased estimates of every per-layer trace in one backward pass through the computational graph. Correctness under weight sharing requires the layer-wise Hessian to be assembled before the second differentiation: unrolling shared weights into independent coordinates introduces a systematic bias whose sign and magnitude are governed by the cross-instance blocks of the unrolled Hessian. A closed-form expression for the variance of the estimator at a fixed Hessian is derived, together with a decomposition of the total variance under the mini-batch s

What carries the argument

Hutchinson stochastic trace estimator paired with a single Hessian-vector product over the full parameter vector, which recovers per-layer traces in one backward pass.

If this is right

Per-layer curvature traces become available for online monitoring without forming the full Hessian.
The variance decomposition identifies a critical probe count K* that balances stochastic and sampling variance.
A cumulative-sum decision rule on the estimates detects the label-memorisation regime at empirical power 179/180 with false-alarm rate 16/120 on the tested models.
Weight-sharing architectures require explicit pre-assembly of layer-wise Hessians to keep the estimates unbiased.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same one-pass structure could be used to track curvature changes layer by layer and adapt per-layer learning rates during training.
The bias analysis implies that any second-order method applied to CNNs or other weight-sharing models must handle shared parameters before differentiation to remain consistent.
Extending the probe-count analysis to other stochastic second-order quantities such as Hessian diagonals or low-rank approximations would be a direct next step.

Load-bearing premise

The layer-wise Hessian must be assembled before the second differentiation when weights are shared, otherwise unrolling introduces bias from cross-instance blocks.

What would settle it

Computing exact layer-wise Hessian traces on a small shared-weight network and comparing them to the stochastic estimates both with and without pre-assembly would show whether the bias appears exactly as predicted by the cross-instance blocks.

read the original abstract

The loss and the norm of its gradient separate the healthy and the pathological regimes of neural-network training only weakly, whilst the curvature of the empirical risk differs qualitatively between them but is inaccessible explicitly at parameter counts $P\sim 10^{6}-10^{8}$. We present a stochastic estimator of the trace of the diagonal blocks of the Hessian matrix of the empirical risk of a neural network. The procedure combines the Hutchinson stochastic trace estimator with a single Hessian-vector product over the whole parameter vector and recovers unbiased estimates of every per-layer trace in one backward pass through the computational graph. We show that correctness under weight sharing requires the layer-wise Hessian to be assembled before the second differentiation: unrolling shared weights into independent coordinates introduces a systematic bias whose sign and magnitude are governed by the cross-instance blocks of the unrolled Hessian. A closed-form expression for the variance of the estimator at a fixed Hessian is derived, together with a decomposition of the total variance under the mini-batch sampling distribution. This decomposition yields a critical probe count $K^{\star}$ that balances the two sources of randomness and supports the practical recommendation $K\in[5,10]$ in the on-line monitoring regime. The estimator is applied to the detection of the label-memorisation regime of ResNet-18, ResNet-34, and VGG-11 on CIFAR-10 and CIFAR-100, where a calibrated cumulative-sum decision rule attains an empirical detection power of $179/180$ at a false-alarm rate of $16/120$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a usable stochastic estimator for per-layer Hessian traces that stays unbiased with shared weights if assembled correctly, plus variance analysis and a working memorization detector on small CNNs.

read the letter

The main thing to know is that this paper gives a stochastic way to estimate the trace of each layer's Hessian block using Hutchinson's method plus one Hessian-vector product over the full parameter vector. It recovers unbiased per-layer values in a single backward pass when the layer-wise assembly happens before the second differentiation, and it derives a closed-form variance that splits into probe noise and mini-batch noise to recommend K in 5-10.

What stands out as new is the explicit weight-sharing correction, the variance decomposition that yields the critical probe count, and the calibrated cumulative-sum rule for detecting label memorization. The abstract shows the estimator applied to ResNet-18/34 and VGG-11 on CIFAR-10/100, where it reaches 179/180 detection power at 16/120 false alarms.

The work does well on the math: the unbiasedness condition is stated plainly, the variance formula is given, and the practical recommendation follows directly from the decomposition. The memorization experiment serves as a concrete test case rather than just a toy.

The soft spots are narrow scope. All results stay with CIFAR and those three architectures, so scaling behavior on larger models or other tasks is not shown. The detector calibration is tuned to this setup, which may limit immediate transfer.

This is for researchers who build training monitors or study curvature effects in optimization. A reader who wants an implementable tool for per-layer diagnostics would find the estimator and probe advice worth trying.

It deserves a serious referee because the core estimator is grounded and the application demonstrates utility without obvious circularity or fitting issues.

Referee Report

1 major / 2 minor

Summary. The paper claims to provide a stochastic estimator for the trace of the diagonal blocks of the Hessian of the empirical risk, combining the Hutchinson trace estimator with a single Hessian-vector product over the full parameter vector to recover unbiased per-layer traces in one backward pass. It derives a closed-form variance expression at fixed Hessian together with a decomposition under the mini-batch sampling distribution that identifies a critical probe count K*, recommends K in [5,10] for online monitoring, and applies the estimator to detect label-memorization regimes in ResNet-18/34 and VGG-11 on CIFAR-10/100, attaining empirical detection power 179/180 at false-alarm rate 16/120. The manuscript explicitly states that unbiasedness under weight sharing requires assembling the layer-wise Hessian before the second differentiation.

Significance. If the unbiasedness claim and the closed-form variance hold, the estimator supplies an efficient, single-backward-pass method for tracking layer-wise curvature that separates healthy and pathological training regimes more clearly than loss or gradient norm. The explicit treatment of the weight-sharing ordering condition and the variance decomposition that yields a practical K recommendation are concrete strengths; the high empirical detection power on standard CNNs further supports utility for online monitoring.

major comments (1)

[Abstract] Abstract: the central unbiasedness claim for per-layer traces is conditioned on assembling the layer-wise Hessian before the second differentiation; the manuscript must demonstrate that the computational-graph implementation enforces this ordering (rather than unrolling shared weights into independent coordinates) and quantify the resulting bias term arising from the cross-instance blocks of the unrolled Hessian, as this directly governs correctness wherever weight tying occurs.

minor comments (2)

[Abstract] Abstract: the phrase 'calibrated cumulative-sum decision rule' and the calibration procedure itself should be defined explicitly, including how the threshold is chosen from the reported 16/120 false-alarm rate.
The variance decomposition under mini-batch sampling is used to motivate K in [5,10]; a brief cross-reference to the equation defining K* would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the unbiasedness claim. We address the point below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central unbiasedness claim for per-layer traces is conditioned on assembling the layer-wise Hessian before the second differentiation; the manuscript must demonstrate that the computational-graph implementation enforces this ordering (rather than unrolling shared weights into independent coordinates) and quantify the resulting bias term arising from the cross-instance blocks of the unrolled Hessian, as this directly governs correctness wherever weight tying occurs.

Authors: We agree that the implementation detail merits explicit demonstration. The current manuscript already states the requirement and the source of bias, but does not include a concrete verification that the computational graph respects the ordering. In the revision we will add a short paragraph (likely in Section 3) describing how the PyTorch autograd graph is constructed on the layer-wise parameter tensors as assembled by the model definition; this automatically performs the first differentiation on the assembled blocks before the second differentiation, avoiding independent-coordinate unrolling. We will also insert an explicit expression for the bias term (the sum of the cross-instance Hessian blocks scaled by the appropriate factors from the Hutchinson estimator) to quantify its contribution under weight tying. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation relies on the standard Hutchinson stochastic trace estimator combined with a single Hessian-vector product, with explicit analysis of bias under weight sharing and a closed-form variance expression. No quoted step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain; the central unbiasedness claim is conditioned on an ordering requirement that is stated rather than assumed away. The procedure is self-contained against external stochastic trace methods.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the standard Hutchinson trace estimator and a domain-specific handling of weight sharing; no new entities are postulated and the only free parameter is the probe count whose range is recommended rather than fitted.

free parameters (1)

probe count K
Recommended interval [5,10] derived from variance balance; not fitted to target data.

axioms (2)

standard math Hutchinson stochastic trace estimator is unbiased for the trace of a symmetric matrix
Invoked to justify the core estimation step.
domain assumption Layer-wise Hessian must be assembled before second differentiation when weights are shared
Stated explicitly as required for unbiasedness under weight sharing.

pith-pipeline@v0.9.1-grok · 5823 in / 1411 out tokens · 36189 ms · 2026-06-29T22:43:21.422384+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 3 internal anchors

[1]

In: Proc

Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Ma- haraj, T., Fischer, A., Courville, A., Bengio, Y., Lacoste-Julien, S.: A closer look at memorization in deep networks. In: Proc. of ICML. pp. 233–242 (2017)

2017
[2]

Journal of the ACM 58(2), 1–34 (2011)

Avron,H.,Toledo,S.:Randomizedalgorithmsforestimatingthetraceofanimplicit symmetric positive semi-definite matrix. Journal of the ACM 58(2), 1–34 (2011)

2011
[3]

In: Proc

Becker, S., LeCun, Y.: Improving the convergence of back-propagation learning with second-order methods. In: Proc. of the 1988 Connectionist Models Summer School. vol. 2 (1988)

1988
[4]

Bolshim, M.A., Kugaevskikh, A.V.: Inter-layer Hessian as a tool for neural network analysis (2026), preprint; currently under review

2026
[5]

In: Proc

Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via Hessian eigenvalue density. In: Proc. of ICML. pp. 2232–2241 (2019)

2019
[6]

Springer Science & Business Media (2012)

Hawkins, D.M., Olwell, D.H.: Cumulative Sum Charts and Charting for Quality Improvement. Springer Science & Business Media (2012)

2012
[7]

In: Proc

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. of IEEE CVPR. pp. 770–778 (2016)

2016
[8]

Communications in Statistics – Simulation and Com- putation 18(3), 1059–1076 (1989)

Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics – Simulation and Com- putation 18(3), 1059–1076 (1989)

1989
[9]

Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)

2009
[10]

In: Proc

Martens, J.: Deep learning via Hessian-free optimization. In: Proc. of ICML. pp. 735–742 (2010)

2010
[11]

In: Proc

Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored ap- proximate curvature. In: Proc. of ICML. pp. 2408–2417 (2015)

2015
[12]

In: Proc

Meyer, R.A., Musco, C., Musco, C., Woodruff, D.P.: Hutch++: Optimal stochastic trace estimation. In: Proc. of the SIAM Symposium on Simplicity in Algorithms (SOSA). pp. 142–155 (2021)

2021
[13]

John Wiley & Sons (2020)

Montgomery, D.C.: Introduction to Statistical Quality Control. John Wiley & Sons (2020)

2020
[14]

Biometrika 41(1/2), 100–115 (1954)

Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954)

1954
[15]

Journal of Machine Learning Research 21(252), 1–64 (2020)

Papyan, V.: Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research 21(252), 1–64 (2020)

2020
[16]

Neural Computation 6(1), 147–160 (1994)

Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Computation 6(1), 147–160 (1994)

1994
[17]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Sagun, L., Evci, U., Güney, V.U., Dauphin, Y.N., Bottou, L.: Empirical analysis of the Hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Neural Computation 14(7), 1723–1738 (2002)

Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradi- ent descent. Neural Computation 14(7), 1723–1738 (2002)

2002
[19]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

In: Proc

Yao, Z., Gholami, A., Keutzer, K., Mahoney, M.W.: PyHessian: Neural networks through the lens of the Hessian. In: Proc. of IEEE Int. Conf. on Big Data. pp. 581–590 (2020)

2020
[21]

Understanding deep learning requires rethinking generalization

Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

In: Proc

Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Ma- haraj, T., Fischer, A., Courville, A., Bengio, Y., Lacoste-Julien, S.: A closer look at memorization in deep networks. In: Proc. of ICML. pp. 233–242 (2017)

2017

[2] [2]

Journal of the ACM 58(2), 1–34 (2011)

Avron,H.,Toledo,S.:Randomizedalgorithmsforestimatingthetraceofanimplicit symmetric positive semi-definite matrix. Journal of the ACM 58(2), 1–34 (2011)

2011

[3] [3]

In: Proc

Becker, S., LeCun, Y.: Improving the convergence of back-propagation learning with second-order methods. In: Proc. of the 1988 Connectionist Models Summer School. vol. 2 (1988)

1988

[4] [4]

Bolshim, M.A., Kugaevskikh, A.V.: Inter-layer Hessian as a tool for neural network analysis (2026), preprint; currently under review

2026

[5] [5]

In: Proc

Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via Hessian eigenvalue density. In: Proc. of ICML. pp. 2232–2241 (2019)

2019

[6] [6]

Springer Science & Business Media (2012)

Hawkins, D.M., Olwell, D.H.: Cumulative Sum Charts and Charting for Quality Improvement. Springer Science & Business Media (2012)

2012

[7] [7]

In: Proc

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. of IEEE CVPR. pp. 770–778 (2016)

2016

[8] [8]

Communications in Statistics – Simulation and Com- putation 18(3), 1059–1076 (1989)

Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics – Simulation and Com- putation 18(3), 1059–1076 (1989)

1989

[9] [9]

Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)

2009

[10] [10]

In: Proc

Martens, J.: Deep learning via Hessian-free optimization. In: Proc. of ICML. pp. 735–742 (2010)

2010

[11] [11]

In: Proc

Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored ap- proximate curvature. In: Proc. of ICML. pp. 2408–2417 (2015)

2015

[12] [12]

In: Proc

Meyer, R.A., Musco, C., Musco, C., Woodruff, D.P.: Hutch++: Optimal stochastic trace estimation. In: Proc. of the SIAM Symposium on Simplicity in Algorithms (SOSA). pp. 142–155 (2021)

2021

[13] [13]

John Wiley & Sons (2020)

Montgomery, D.C.: Introduction to Statistical Quality Control. John Wiley & Sons (2020)

2020

[14] [14]

Biometrika 41(1/2), 100–115 (1954)

Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954)

1954

[15] [15]

Journal of Machine Learning Research 21(252), 1–64 (2020)

Papyan, V.: Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research 21(252), 1–64 (2020)

2020

[16] [16]

Neural Computation 6(1), 147–160 (1994)

Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Computation 6(1), 147–160 (1994)

1994

[17] [17]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Sagun, L., Evci, U., Güney, V.U., Dauphin, Y.N., Bottou, L.: Empirical analysis of the Hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Neural Computation 14(7), 1723–1738 (2002)

Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradi- ent descent. Neural Computation 14(7), 1723–1738 (2002)

2002

[19] [19]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[20] [20]

In: Proc

Yao, Z., Gholami, A., Keutzer, K., Mahoney, M.W.: PyHessian: Neural networks through the lens of the Hessian. In: Proc. of IEEE Int. Conf. on Big Data. pp. 581–590 (2020)

2020

[21] [21]

Understanding deep learning requires rethinking generalization

Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016