Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures

Alexander Kugaevskikh (1) ((1) ITMO University; Maxim Bolshim (1); Russia); Saint Petersburg

arxiv: 2604.11639 · v1 · submitted 2026-04-13 · 💻 cs.LG

Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures

Maxim Bolshim (1) , Alexander Kugaevskikh (1) ((1) ITMO University , Saint Petersburg , Russia) This is my paper

Pith reviewed 2026-05-10 15:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords Hessian decompositionGauss-Newton approximationinter-layer curvatureDAG neural architecturesReLU convexitysaddle-point analysisoptimization landscapeskip-connection resonance

0 comments

The pith

The loss Hessian in arbitrary DAG neural networks decomposes into a convex Gauss-Newton part plus a tensor part that vanishes almost everywhere for ReLU activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies an exact block decomposition of the full Hessian that indexes curvature interactions by the edges of any DAG architecture. It separates the Gauss-Newton term, which is always positive semi-definite, from a residual tensor term that encodes the non-convexity responsible for saddle points. For piecewise-linear activations the tensor term disappears in the input Hessian, so the curvature seen by the inputs is entirely convex. The same decomposition yields four cheap diagnostic quantities—inter-layer resonance, geometric coupling, stable rank, and GN-Gap—that can be estimated from gradients alone and that explain why resonance decays in plain networks yet survives skip connections. When the DAG collapses to a single node the definitions recover the ordinary Hessian.

Core claim

An arbitrary neural-network loss admits the canonical splitting H = H^{GN} + H^T, where H^{GN} collects the convex Gauss-Newton contributions and H^T isolates the tensor curvature that creates saddle points. For ReLU activations the input Hessian satisfies H^T_{v,w} ≡ 0 almost everywhere, so H^f_{v,w} = H^{GN}_{v,w} ≽ 0; the parametric Hessian still retains residual tensor terms. The block structure is indexed by the DAG edges, and the resulting metrics (resonance R, coupling C, stable rank D, GN-Gap) are computable in linear time and reveal exponential decay of resonance in vanilla networks together with its preservation under skip connections.

What carries the argument

The canonical block decomposition H = H^{GN} + H^T indexed by DAG edges, which isolates the convex Gauss-Newton component from the tensor residual that vanishes for ReLU input Hessians.

If this is right

The input Hessian of a ReLU network is positive semi-definite almost everywhere.
Inter-layer resonance decays exponentially in feed-forward networks but remains stable when skip connections are present.
The four diagnostic metrics can be estimated from stochastic gradients in O(P) time even for ResNet-scale models.
When the architecture is a single node the entire formalism reduces to the classical Hessian.
Structural curvature between layers can be diagnosed without ever forming the full p-by-p matrix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition may let practitioners monitor and steer the location of saddle points by inserting or removing skip connections.
The resonance metric could serve as an online diagnostic for choosing layer widths or activation types during architecture search.
Extending the same block analysis to other piecewise-linear activations would show how the width of the linear regions controls the size of the convex region.
The GN-Gap quantity might correlate with the number of effective negative-curvature directions encountered during training.

Load-bearing premise

Any neural-network architecture admits an exact block decomposition of its Hessian indexed by the DAG edges, with the tensor term vanishing almost everywhere for ReLU activations under standard twice-differentiability.

What would settle it

A non-zero tensor component observed in the input Hessian of a ReLU network at a twice-differentiable point, or failure of the single-node case to recover the ordinary Hessian matrix.

Figures

Figures reproduced from arXiv: 2604.11639 by Alexander Kugaevskikh (1) ((1) ITMO University, Maxim Bolshim (1), Russia), Saint Petersburg.

**Figure 2.** Figure 2: Exp. 1: Geometric coupling C¯(d) vs. distance d (error bars: ±1σ over 5 seeds). (a) L= 8, (b) L= 12. Plain MLP: C decays monotonically from 1 to 0.24 (L= 12, init), reflecting loss of geometric coherence between distant layers. Residual MLP: C >0.93 at all distances; skip connections preserve coupling. After training both architectures shift upward (curvature becomes more uniform). 0 1 2 3 4 5 6 7 10−1 100… view at source ↗

**Figure 3.** Figure 3: Exp. 1: Mean resonance R¯(d) vs. distance d between layers (log scale on y; error bars ±1σ over 5 seeds). (a) L= 8, (b) L = 12. Plain MLP exhibits exponential decay (straight lines on log scale, R2 > 0.91); Residual MLP shows stabilization (b≈0.02). Scale differs: init ∼ 10−1 , final ∼ 100 (reflects overall curvature growth during training). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Exp. 2: Stable rank Dfar vs. bottleneck width du (CIFAR-100). Solid/dashed: init/final; bright: L= 6, faded: L= 8. L= 8: Dinit far is lower for du ≥64 (∼8–20 %), consistent with additional rank compression at greater distance. Dotted: theoretical bound min(du, K−1). du = 512 ( † , control: dbase = 512, no bottleneck) [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Exp. 3: GN-Gap (init) vs. E[σ ′′(z) 2 ] for 5 activations (isolation protocol, LayerNorm off). Linear fit R2 = 0.908; Spearman rank correlation ρs = 0.97 (p<0.01). With n= 5, R2 has limited power, but monotonicity of ranking (ρs ≈1) robustly confirms the scaling HT ∼σ ′′ 2 . zero. To demonstrate the non-triviality of formula equation 2 on DAG topologies, we introduce the minimal Diamond MLP construction (E… view at source ↗

**Figure 6.** Figure 6: Exp. 4: GN-Gap at the merge node of Diamond MLP ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Exp. 6: Mean resonance R¯(d) for ResNet-18 (CIFAR-10, log-scale on y; ±1σ over 5 seeds). (a) ReLU: ResNet preserves R¯ at init (R(0)/R(4) ≈ 1); Plain decays by ∼627×. (b) SiLU: decay is stronger—Plain init 47 000×; ResNet 25×, slower than Plain (skip connections stabilize curvature) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition $H = H^{GN} + H^T$ separates the Gauss--Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ($H^{T}_{v,w}\!\equiv\!0$ a.e., $H^f_{v,w}\!=\!H^{GN}_{v,w}\!\succeq\!0$); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~$\mathcal{R}$, geometric coupling~$\mathcal{C}$, stable rank~$\mathcal{D}$, GN-Gap) that are estimated stochastically in $O(P)$ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.\,1--5) and convolutional architectures (ResNet-18, ${\sim}11$M~parameters, Exp.\,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian $\nabla^2_\theta\mathcal{L}(\theta)\in\mathbb{R}^{p\times p}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a DAG-indexed block decomposition of the Hessian plus four stochastic metrics for inter-layer curvature, but the abstract leaves the derivations thin.

read the letter

The main thing is a structured decomposition of the full Hessian into blocks indexed by the edges of the network DAG, splitting it into a Gauss-Newton part and a residual tensor part, with the tensor term vanishing almost everywhere for ReLU input Hessians. They build four cheap stochastic diagnostics on top of that: inter-layer resonance, geometric coupling, stable rank, and GN-Gap. These are the genuinely new pieces; the decomposition itself follows from standard chain-rule expansions on the computation graph, but the specific quartet of metrics and the way they are estimated in linear time in the parameter count do not appear in the earlier Hessian literature they cite. The write-up does a clean job showing how the resonance metric decays exponentially in plain MLPs yet stays stable under skip connections, and they run the diagnostics on both small fully-connected nets and a full ResNet-18, which is reasonable coverage for a methods paper. The ReLU vanishing claim is consistent with what is already known about piecewise-linear activations and the GGN decomposition, so that part holds up without contradiction. The soft spots are mostly in the presentation. The abstract states the formalism and the vanishing result but gives no derivation steps, no explicit list of assumptions, and no error bounds, so it is difficult to verify whether the block decomposition is exact for every DAG or only under additional smoothness conditions. The empirical section is described at a high level without numbers, baselines, or ablation details, which makes it hard to judge how much new signal the metrics actually provide over existing curvature probes. This is aimed at researchers who already work on loss-landscape geometry, second-order optimization, or architecture analysis and want practical tools to inspect layer-to-layer interactions. A reader in that niche would get concrete diagnostic quantities they could implement and test. It deserves a serious referee because the core idea is coherent, the metrics are well-defined and cheap, and the ReLU observation is a useful clarification even if it is not entirely surprising. Send it for review, but ask the authors to expand the math section with full derivations and to add quantitative experimental results with direct comparisons to standard Hessian approximations.

Referee Report

2 major / 2 minor

Summary. The manuscript develops an analytical formalism that decomposes the Hessian of the loss for neural networks whose architectures are directed acyclic graphs (DAGs) into inter-layer blocks. It introduces the canonical splitting H = H^{GN} + H^T that isolates the Gauss-Newton (convex) term from the residual tensor curvature, proves that the tensor term vanishes almost everywhere for the input Hessian under ReLU activations (while residual terms remain in the full parametric Hessian), and defines four stochastic diagnostic metrics (inter-layer resonance R, geometric coupling C, stable rank D, GN-Gap) that can be estimated in O(P) time. The formalism is shown to recover the ordinary Hessian when the DAG reduces to a single node; theoretical consequences include exponential decay of resonance in feed-forward nets and its preservation under skip connections. Empirical support is provided on fully-connected MLPs (Experiments 1-5) and ResNet-18 (Experiment 6).

Significance. If the block decomposition and the ReLU vanishing result are rigorously established, the work supplies a structured, architecture-aware view of curvature that could clarify optimization dynamics and the role of skip connections. The linear-time stochastic estimators for the new metrics are practically attractive for models with millions of parameters. The consistency check that all quantities collapse to the standard Hessian on a single-node DAG is a useful sanity property.

major comments (2)

[§3] §3 (Theoretical Analysis), the statement of the canonical decomposition H = H^{GN} + H^T and the subsequent claim that H^T_{v,w} ≡ 0 a.e. for ReLU input Hessians: no derivation steps, explicit twice-differentiability assumptions, or error-bound analysis are supplied, yet this vanishing result is load-bearing for all later claims about input-Hessian positivity and the distinction between input and parametric Hessians.
[Experiments] Experiments section, Exp. 6 (ResNet-18, ~11 M parameters): the reported structural curvature interactions are described qualitatively without quantitative values, baselines, or statistical significance tests for the metrics R, C, D, and GN-Gap, which weakens the empirical support for the inter-layer resonance preservation claim under skip connections.

minor comments (2)

[Abstract] Abstract: the symbols H^f_{v,w} and H^{GN}_{v,w} appear without prior definition or reference to the block-indexing scheme.
[Notation] Notation section: the precise stochastic estimators for R, C, D, and GN-Gap are introduced only by name; the O(P) complexity claim would be clearer if the sampling procedure were stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We respond to each major comment below and commit to revisions that address the concerns raised.

read point-by-point responses

Referee: [§3] §3 (Theoretical Analysis), the statement of the canonical decomposition H = H^{GN} + H^T and the subsequent claim that H^T_{v,w} ≡ 0 a.e. for ReLU input Hessians: no derivation steps, explicit twice-differentiability assumptions, or error-bound analysis are supplied, yet this vanishing result is load-bearing for all later claims about input-Hessian positivity and the distinction between input and parametric Hessians.

Authors: We acknowledge that the main text of §3 presents the canonical decomposition H = H^{GN} + H^T and the ReLU vanishing result without sufficient intermediate derivation steps or explicit discussion of differentiability. In the revised manuscript we will insert a self-contained subsection that derives the block decomposition from the chain rule applied to the DAG structure, states the twice-differentiability assumptions (ReLU is twice differentiable almost everywhere, with the non-differentiable set of measure zero justifying the a.e. claim), and supplies a short error-bound argument showing that the tensor term vanishes in the input-Hessian case while remaining in the parametric Hessian. These additions will make the load-bearing theoretical claims fully traceable. revision: yes
Referee: [Experiments] Experiments section, Exp. 6 (ResNet-18, ~11 M parameters): the reported structural curvature interactions are described qualitatively without quantitative values, baselines, or statistical significance tests for the metrics R, C, D, and GN-Gap, which weakens the empirical support for the inter-layer resonance preservation claim under skip connections.

Authors: We agree that the current qualitative description of Experiment 6 limits the strength of the empirical claims. In the revision we will replace the qualitative summary with tables reporting the numerical values of R, C, D, and GN-Gap for ResNet-18, include a baseline comparison against an equivalent fully-connected network of similar depth and width, and add results aggregated over five independent runs together with standard deviations and two-sided t-test p-values against the null hypothesis of no resonance preservation. These quantitative elements will directly support the theoretical prediction that skip connections maintain inter-layer resonance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives the block decomposition H = H^{GN} + H^T directly from standard chain-rule expansions on the DAG under twice-differentiability, with the ReLU vanishing of H^T following immediately from the second derivative being zero a.e. The introduced metrics (resonance R, coupling C, etc.) are defined from the resulting blocks rather than fitted or predicted from data. No self-citations, ansatzes, or uniqueness theorems are invoked to support the central claims; the single-node reduction is a consistency check that recovers the known Hessian. The formalism is therefore independent of its own outputs and does not reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The central claim rests on standard calculus (twice differentiability) and the modeling assumption that networks are DAGs; the four diagnostic metrics are newly defined quantities whose independent predictive power is not demonstrated outside the paper.

axioms (2)

domain assumption Neural network computation graphs are directed acyclic graphs (DAGs).
Used to index the Hessian blocks by layer pairs.
standard math The loss is twice differentiable almost everywhere.
Required for the existence of the Hessian and its decomposition.

invented entities (4)

Inter-layer resonance R no independent evidence
purpose: Scalar diagnostic of curvature interaction between layers
Newly introduced metric estimated stochastically.
Geometric coupling C no independent evidence
purpose: Scalar diagnostic of curvature interaction between layers
Newly introduced metric estimated stochastically.
Stable rank D no independent evidence
purpose: Scalar diagnostic of curvature interaction between layers
Newly introduced metric estimated stochastically.
GN-Gap no independent evidence
purpose: Scalar diagnostic measuring difference between full and GN Hessian
Newly introduced metric estimated stochastically.

pith-pipeline@v0.9.0 · 5595 in / 1567 out tokens · 95705 ms · 2026-05-10T15:05:02.253242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Eachf v is locally Lipschitz. 25

work page
[2]

]⊤ intersects non-smoothness hyperplanes transversally), the set of non-smooth points has Lebesgue measure zero

Under a transversality condition on the activation maps (the mappingx7→[f u1(x), . . .]⊤ intersects non-smoothness hyperplanes transversally), the set of non-smooth points has Lebesgue measure zero

work page
[3]

At points of differentiability, the AD-Hessian coincides with the classical one

work page
[4]

geometric connectivity

At non-smooth points, AD frameworks return an AD-Hessian (withT≡0), which amounts to select- ing the zero element from the generalized derivative and preserves algorithmic convergence by CSVF theory (Bolte and Pauwels, 2021). Definition 47(AD-Hessian at a non-smooth point).For ReLU networks, AD frameworks setσ′′(0) := 0, so thatTu;v ≡0. The input Hessian ...

work page 2021
[5]

The Gauss–Newton componentHGN is always positive semi-definite:m−(H GN) = 0

work page
[6]

The full HessianH f ull =H GN +H T may havem −(H f ull)>0, indicating a saddle point

work page
[7]

34 Proof.Item 1:H GN is a sum of matrices of the formD⊤ c←vH L outDc←w, whereH L out ⪰0for convexL(see proof of Theorem 7)

Negative curvature arises exclusively from the tensor componentHT. 34 Proof.Item 1:H GN is a sum of matrices of the formD⊤ c←vH L outDc←w, whereH L out ⪰0for convexL(see proof of Theorem 7). Item 2: the tensor component contains termsP i Tu;v,wδu,i whereδ u,i can be negative (e.g. under mis- classification) and the tensorsTneed not be PSD. Item 3: the ten...

work page 2014
[8]

The time complexity of computing the full Hessian isO(nsd3 +nsd 2P+P 2)in the general case with dense tensors

work page
[9]

For networks with element-wise activation functions (ReLU, sigmoid), where tensorsTu;v andT u;v,w are diagonal or sparse withO(d)cost, the total time reduces toO(nsd2 +nsdP+P 2)

work page
[10]

The space complexity of storing the full Hessian isO(P2)

work page
[11]

For a fully connected DAG (s=O(n)) the time complexity isO(n2d3 +n 2d2P+P 2). Proof.1. Computing the input HessianH f v,w: By formula equation 2, for each pair(v, w): compute JacobiansD u1←v andD u2←w for all(u 1, u2)∈ Ch(v)×Ch(w):O(s 2d2)per pair; compute blocksH f u1,u2 (recursively) and multiplyD ⊤ u1←vH f u1,u2 Du2←w: O(s2d3); compute mixed-derivative...

work page
[12]

IfT u;v are diagonal (element-wise activations), the cost reduces toO(nsdP+P 2)

Computing the parametric HessianHθv,θw:Total for all pairs:O(nsdP). IfT u;v are diagonal (element-wise activations), the cost reduces toO(nsdP+P 2)

work page
[13]

Total:O(nsd 3 +nsd 2P+P 2)

Assembly:O(P 2)for placing blocks. Total:O(nsd 3 +nsd 2P+P 2). Corollary 90(Summary of computational costs).Letn=|V|,Pthe total number of parameters,d= maxv dv,s= max v |Pa(v)∪Ch(v)|. (1)MetricsR,C,D:the blockH f v,w ∈R dv×dw is computed without building the fullP×PHessian; when even the block cannot be stored, the Hutchinson stochastic estima- tor (Avron...

work page 2011

[1] [1]

Eachf v is locally Lipschitz. 25

work page

[2] [2]

]⊤ intersects non-smoothness hyperplanes transversally), the set of non-smooth points has Lebesgue measure zero

Under a transversality condition on the activation maps (the mappingx7→[f u1(x), . . .]⊤ intersects non-smoothness hyperplanes transversally), the set of non-smooth points has Lebesgue measure zero

work page

[3] [3]

At points of differentiability, the AD-Hessian coincides with the classical one

work page

[4] [4]

geometric connectivity

At non-smooth points, AD frameworks return an AD-Hessian (withT≡0), which amounts to select- ing the zero element from the generalized derivative and preserves algorithmic convergence by CSVF theory (Bolte and Pauwels, 2021). Definition 47(AD-Hessian at a non-smooth point).For ReLU networks, AD frameworks setσ′′(0) := 0, so thatTu;v ≡0. The input Hessian ...

work page 2021

[5] [5]

The Gauss–Newton componentHGN is always positive semi-definite:m−(H GN) = 0

work page

[6] [6]

The full HessianH f ull =H GN +H T may havem −(H f ull)>0, indicating a saddle point

work page

[7] [7]

34 Proof.Item 1:H GN is a sum of matrices of the formD⊤ c←vH L outDc←w, whereH L out ⪰0for convexL(see proof of Theorem 7)

Negative curvature arises exclusively from the tensor componentHT. 34 Proof.Item 1:H GN is a sum of matrices of the formD⊤ c←vH L outDc←w, whereH L out ⪰0for convexL(see proof of Theorem 7). Item 2: the tensor component contains termsP i Tu;v,wδu,i whereδ u,i can be negative (e.g. under mis- classification) and the tensorsTneed not be PSD. Item 3: the ten...

work page 2014

[8] [8]

The time complexity of computing the full Hessian isO(nsd3 +nsd 2P+P 2)in the general case with dense tensors

work page

[9] [9]

For networks with element-wise activation functions (ReLU, sigmoid), where tensorsTu;v andT u;v,w are diagonal or sparse withO(d)cost, the total time reduces toO(nsd2 +nsdP+P 2)

work page

[10] [10]

The space complexity of storing the full Hessian isO(P2)

work page

[11] [11]

For a fully connected DAG (s=O(n)) the time complexity isO(n2d3 +n 2d2P+P 2). Proof.1. Computing the input HessianH f v,w: By formula equation 2, for each pair(v, w): compute JacobiansD u1←v andD u2←w for all(u 1, u2)∈ Ch(v)×Ch(w):O(s 2d2)per pair; compute blocksH f u1,u2 (recursively) and multiplyD ⊤ u1←vH f u1,u2 Du2←w: O(s2d3); compute mixed-derivative...

work page

[12] [12]

IfT u;v are diagonal (element-wise activations), the cost reduces toO(nsdP+P 2)

Computing the parametric HessianHθv,θw:Total for all pairs:O(nsdP). IfT u;v are diagonal (element-wise activations), the cost reduces toO(nsdP+P 2)

work page

[13] [13]

Total:O(nsd 3 +nsd 2P+P 2)

Assembly:O(P 2)for placing blocks. Total:O(nsd 3 +nsd 2P+P 2). Corollary 90(Summary of computational costs).Letn=|V|,Pthe total number of parameters,d= maxv dv,s= max v |Pa(v)∪Ch(v)|. (1)MetricsR,C,D:the blockH f v,w ∈R dv×dw is computed without building the fullP×PHessian; when even the block cannot be stored, the Hutchinson stochastic estima- tor (Avron...

work page 2011