Spectral Perturbation of the Empirical Fisher Information Matrix under Weight Quantization
Pith reviewed 2026-06-30 01:35 UTC · model grok-4.3
The pith
Quantization of model weights strictly increases the dominant eigenvalue of the empirical Fisher information matrix at leading order under a mild genericity condition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Via Weyl's inequality the largest eigenvalue lambda_max of the empirical Fisher information matrix satisfies a lower bound under additive quantization noise that equals its unperturbed value up to a third-order remainder; under a mild genericity condition the inequality is strict at leading order, so lambda_max(F_quantized) > lambda_max(F_full) at the first non-vanishing term. A parallel elevation result holds for input departure from a reference ensemble once local curvature monotonicity of lambda_max is assumed. Two tractable approximations to lambda_max are derived together with a completeness statement for a threshold-based partition of an augmented state space.
What carries the argument
Weyl's inequality applied to the quantization-noise perturbation of the empirical FIM, yielding the directional bound on lambda_max.
If this is right
- The normalized statistic sigma_t = lambda_max(F_t)/lambda_base supplies a runtime monitor whose calibration threshold must be raised for quantized models.
- Two explicit approximations to lambda_max become available for real-time use, one heuristic and one with a rigorous two-sided guarantee.
- A threshold-based partition of the augmented state space is exhaustive once the completeness result holds.
- The observed 244-fold inflation of the calibration threshold on a 4-bit model is explained as a direct consequence of the quantization-induced spectral shift.
Where Pith is reading between the lines
- The same Weyl argument could be applied to other structured perturbations such as pruning masks or low-rank adapters to predict their effect on lambda_max.
- If the genericity condition fails on some layers, the strict elevation may reduce to a weak inequality, suggesting a diagnostic that checks the leading-order term on real weight tensors.
- The monitoring statistic could be tested for early-warning power on out-of-distribution inputs whose departure is known to violate curvature monotonicity.
- Closed-form prediction of the exact inflation magnitude remains open and could be pursued by refining the third-order remainder term.
Load-bearing premise
The local curvature-monotonicity hypothesis on lambda_max is required for the input-departure elevation and the mild genericity condition is required for the strict leading-order increase under quantization.
What would settle it
Compute lambda_max of the empirical FIM on the same trajectories before and after 4-bit quantization of a small transformer; the measured values either confirm or refute the strict elevation predicted by Theorem 4.3.
read the original abstract
We study the spectral perturbation of the empirical Fisher Information Matrix (FIM) of a parametric statistical model under two structured perturbations: departure of the input from a reference (in-distribution) ensemble, and finite-precision (quantized) perturbation of the model's parameters. For the first, under an explicit local curvature-monotonicity hypothesis on the dominant eigenvalue lambda_max of the FIM, we show departure from a reference manifold provably elevates lambda_max relative to a calibration baseline (Proposition 3.2), and discuss why this hypothesis is required, since curvature need not increase monotonically under every perturbation. Our principal result is a directional eigenvalue perturbation bound, via Weyl's inequality, showing lambda_max under a quantization noise perturbation is lower bounded by its unperturbed value up to a third-order remainder, and, under a mild genericity condition, strictly exceeds it at leading order (Theorem 4.3). We give two tractable approximations to lambda_max -- one heuristic, one with a rigorous two-sided bound -- and a completeness result for a threshold-based partition of an augmented state space. These results motivate using sigma_t = lambda_max(F_t)/lambda_base as a runtime monitoring statistic for deployed language models: the quantization result offers a mechanism for an empirical observation of our own, where a calibration threshold for this statistic was approximately 244 times larger than a preliminary full-precision estimate on a 4-bit quantized model, a single measurement rather than a value derived in closed form. We report supporting measurements (twelve models, n=1,080 trajectories) broadly consistent with our predictions, discuss the scope and limitations of every result, and state as an open problem the closed-form prediction of the quantization inflation magnitude our bound does not supply.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies spectral perturbations of the empirical Fisher Information Matrix (FIM) under two structured changes: departure of inputs from a reference ensemble, and finite-precision quantization of model weights. Under an explicit local curvature-monotonicity hypothesis on the dominant eigenvalue lambda_max, input departure is shown to elevate lambda_max relative to a calibration baseline (Proposition 3.2). The principal result applies Weyl's inequality to obtain a directional bound showing that quantization noise lower-bounds lambda_max up to a third-order remainder and, under a mild genericity condition, strictly exceeds the unperturbed value at leading order (Theorem 4.3). The authors propose the ratio sigma_t = lambda_max(F_t)/lambda_base as a runtime monitoring statistic for quantized language models, motivated by an empirical 244x inflation observation on a 4-bit model, and report supporting measurements across twelve models and 1,080 trajectories that are broadly consistent with the predictions.
Significance. If the central bounds hold, the work supplies a rigorous perturbation analysis, grounded in Weyl's inequality, for the dominant eigenvalue of the empirical FIM under quantization—an operationally relevant setting for deployed models. The explicit statement and discussion of the curvature-monotonicity hypothesis, together with the acknowledgment that the exact inflation magnitude remains an open problem, adds transparency. The empirical consistency across multiple models provides practical corroboration for the proposed monitoring statistic, though the result is primarily theoretical.
major comments (2)
- [Proposition 3.2] Proposition 3.2: the elevation claim for input departure rests entirely on the local curvature-monotonicity hypothesis; while the manuscript correctly notes that the hypothesis is not universal, the scope of perturbations and model classes for which it holds is not further delimited, leaving the practical reach of the input-departure result dependent on an assumption whose verification is external to the derivation.
- [Theorem 4.3] Theorem 4.3: the third-order remainder in the Weyl-based bound prevents any closed-form prediction of the observed inflation magnitude (explicitly stated as an open problem in the abstract); because the 244x factor is presented only as a measurement rather than a consequence of the leading-order term, the theorem establishes the directional inequality but does not quantitatively account for the scale of the empirical effect that motivates the monitoring statistic.
minor comments (2)
- The experimental section (referenced via the 1,080 trajectories) does not appear to tabulate the precise model architectures, quantization bit-widths, or task distributions; adding a compact table or explicit cross-reference would allow readers to assess the breadth of the consistency claim.
- Notation for the monitoring statistic sigma_t and the baseline lambda_base is introduced late; defining these quantities with a forward reference in the introduction would improve readability for readers primarily interested in the application.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the recommendation of minor revision. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Proposition 3.2] Proposition 3.2: the elevation claim for input departure rests entirely on the local curvature-monotonicity hypothesis; while the manuscript correctly notes that the hypothesis is not universal, the scope of perturbations and model classes for which it holds is not further delimited, leaving the practical reach of the input-departure result dependent on an assumption whose verification is external to the derivation.
Authors: We agree that the local curvature-monotonicity hypothesis is essential to Proposition 3.2 and that its domain of applicability is not exhaustively mapped. In the revised manuscript we will insert a short additional paragraph after the proposition that supplies concrete examples (linear models, small-norm input shifts in overparameterized networks) where the hypothesis can be verified directly, while reiterating that verification remains external for arbitrary models. revision: partial
-
Referee: [Theorem 4.3] Theorem 4.3: the third-order remainder in the Weyl-based bound prevents any closed-form prediction of the observed inflation magnitude (explicitly stated as an open problem in the abstract); because the 244x factor is presented only as a measurement rather than a consequence of the leading-order term, the theorem establishes the directional inequality but does not quantitatively account for the scale of the empirical effect that motivates the monitoring statistic.
Authors: The observation is correct: Theorem 4.3 supplies only a directional bound whose remainder precludes a closed-form magnitude, and the 244x factor is presented strictly as an empirical observation. The manuscript already states the magnitude prediction as an open problem. No revision is required. revision: no
Circularity Check
No significant circularity
full rationale
The central bound (Theorem 4.3) is obtained by direct application of Weyl's inequality to the quantization perturbation of the empirical FIM, producing a lower bound on lambda_max with an explicit third-order remainder and a strict inequality under a stated mild genericity condition. Proposition 3.2 rests on an explicitly declared local curvature-monotonicity hypothesis whose necessity is discussed in the text. The 244x factor is presented as an empirical measurement on a quantized model, not as a closed-form prediction derived from the bound. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatzes smuggled via prior work appear in the derivation chain. The argument is self-contained against the external Weyl inequality and the paper's own stated assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Local curvature-monotonicity hypothesis on lambda_max of the FIM
- domain assumption Mild genericity condition
Reference graph
Works this paper leans on
-
[1]
Amari,Information Geometry and Its Applications
S. Amari,Information Geometry and Its Applications. Tokyo: Springer, 2016
2016
-
[2]
New insights and perspectives on the natural gradient method,
J. Martens, “New insights and perspectives on the natural gradient method,”Journal of Machine Learning Research, vol. 21, no. 146, pp. 1–76, 2020
2020
-
[3]
A practical Bayesian framework for backpropagation networks,
D. J. C. MacKay, “A practical Bayesian framework for backpropagation networks,”Neural Com- putation, vol. 4, no. 3, pp. 448–472, 1992
1992
-
[4]
The spectrum of the Fisher information matrix of a single-hidden- layer neural network,
J. Pennington and P. Worah, “The spectrum of the Fisher information matrix of a single-hidden- layer neural network,” inAdvances in Neural Information Processing Systems, vol. 31, 2018, pp. 5410–5419
2018
-
[5]
A random matrix approach to neural networks,
C. Louart, Z. Liao, and R. Couillet, “A random matrix approach to neural networks,”Annals of Applied Probability, vol. 28, no. 2, pp. 1190–1248, 2018
2018
-
[6]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
L. Sagun, L. Bottou, and Y. LeCun, “Eigenvalues of the Hessian in deep learning: Singularity and beyond,”arXiv preprintarXiv:1611.07476, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Eigenvalue distribution of some nonlinear models of random matrices,
L. Benigni and S. Péché, “Eigenvalue distribution of some nonlinear models of random matrices,” Electronic Journal of Probability, vol. 26, paper no. 150, 2021
2021
-
[8]
Largest eigenvalues of the conjugate kernel of single-layered neural networks,
L. Benigni and S. Péché, “Largest eigenvalues of the conjugate kernel of single-layered neural networks,”arXiv preprintarXiv:2201.04753, 2022. 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.