pith. sign in

arxiv: 2512.12911 · v2 · submitted 2025-12-15 · 📊 stat.ML · cs.LG

Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

Pith reviewed 2026-05-16 22:40 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords singular value decompositionrandom matrix theorydeep neural network weightslow-rank approximationcosine similaritythreshold selectionnoise removal
0
0 comments X

The pith

A cosine similarity metric between singular vectors can check whether random matrix theory thresholds correctly separate signal from noise in DNN weight matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats each DNN weight matrix as the sum of a low-rank signal part and a noise part whose singular values follow random matrix theory. It removes the noise singular values using an RMT-derived threshold to produce a low-rank approximation. To decide whether the chosen threshold is good, the authors introduce a metric that measures the cosine similarity between the leading singular vectors of the recovered signal and the original matrix. Numerical tests on real networks then compare two different ways of estimating the threshold with this new similarity score.

Core claim

The paper shows that the cosine similarity between the singular vectors of the estimated signal matrix and the original weight matrix serves as a practical indicator of whether an RMT-based threshold has removed the right singular values, allowing direct comparison of threshold selection methods on actual DNN weights.

What carries the argument

The cosine similarity between singular vectors of the signal and original weight matrices, which quantifies how well an RMT threshold preserves the dominant directions of the weight matrix.

If this is right

  • Thresholds chosen by the better of the two RMT estimators will produce low-rank weight approximations whose dominant singular vectors remain aligned with the original matrix.
  • The same similarity check can be applied to decide how many singular values to keep in any SVD-based compression of a trained network.
  • Networks whose weight matrices pass the similarity test after thresholding are expected to retain more of their original functional behavior than those that do not.
  • The metric supplies an objective score for ranking different RMT threshold formulas without requiring retraining or validation accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The similarity test could be used at training time to decide when to switch from full-rank to low-rank layers.
  • It may generalize to other matrix factorizations such as non-negative matrix factorization when the goal is to keep the dominant directions intact.
  • Repeated application across layers could reveal which layers in a network are most sensitive to singular-value truncation.

Load-bearing premise

DNN weight matrices can be modeled as the sum of a low-rank signal matrix and additive noise whose singular value distribution matches random matrix theory predictions.

What would settle it

Run the proposed cosine similarity test on a trained DNN weight matrix; if the similarity stays low even after applying the RMT threshold while the singular value histogram clearly deviates from the Marchenko-Pastur law, the evaluation approach fails.

Figures

Figures reproduced from arXiv: 2512.12911 by Hiroki Hashiguchi, Kohei Nishikawa, Koki Shimizu.

Figure 1
Figure 1. Figure 1: shows the estimated MP distributions from W of a multilayer perceptron (MLP) trained on MNIST dataset. The red and blue curves indicate the MP distributions estimated by BEMA and Gaussian broadening, respectively, with the corresponding vertical lines representing the noise–information boundaries. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Singular value 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Probability density [PITH_FULL… view at source ↗
Figure 2
Figure 2. Figure 2: Metric Avew(ϕˆ) and test accuracy with respect to the estimated number of signal singular values (ˆs). Green circles (left y-axis) show Avew(ϕˆ), and purple squares (right y-axis) show test accuracy obtained after keeping the top ˆs singular values (others set to zero). Red and blue vertical lines indicate thresholds estimated by BEMA and Gaussian broadening, respectively. 6 [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 3
Figure 3. Figure 3: Singular value distribution of the FC1 weight matrix in the MLP for different batch sizes. The dashed line represents the MP distribution estimated by BEMA, whereas the solid line indicates the threshold used to determine the number of singular values, ˆs, considered to represent the signal. batch size of 256, and the largest singular values are substantially larger than the others. A batch size of 256 lea… view at source ↗
read the original abstract

This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper models DNN weight matrices as the sum of a low-rank signal matrix and a noise matrix whose singular values follow random matrix theory (RMT) predictions such as the Marchenko-Pastur law. It obtains low-rank approximations by thresholding singular values using RMT-derived thresholds, and proposes a cosine similarity metric between the singular vectors of the resulting signal matrix and the original weight matrix as a way to evaluate threshold adequacy. Numerical experiments are used to compare two threshold estimation methods.

Significance. If the RMT modeling assumption holds for real DNN weights, the cosine-similarity diagnostic could provide a practical, ground-truth-free method for validating noise-removal thresholds in model compression pipelines. The work would then offer a concrete tool for practitioners choosing between RMT-based pruning heuristics. However, the significance is substantially reduced by the absence of any verification that DNN weight matrices exhibit the required RMT bulk spectrum.

major comments (2)
  1. [Abstract / Modeling] Abstract and modeling section: The claim that the cosine similarity metric can assess RMT threshold adequacy rests entirely on the untested premise that each weight matrix W equals a low-rank signal S plus noise N whose singular-value bulk obeys Marchenko-Pastur statistics. No empirical spectral density plots, Kolmogorov-Smirnov tests, or other checks against the RMT bulk are reported for the actual DNN matrices studied; without this, both the thresholds and the proposed metric lack a justified foundation.
  2. [Experiments] Experimental results section: The manuscript states that numerical experiments compare two threshold methods using the cosine metric, yet provides no information on the DNN architectures, layer dimensions, training datasets, or number of weight matrices examined. This omission prevents any assessment of whether the reported differences between methods are statistically meaningful or merely artifacts of particular matrices.
minor comments (1)
  1. [Evaluation Metric] Notation for the cosine similarity is introduced without an explicit equation number; adding a numbered definition would improve traceability when the metric is later used in figures or tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements, which will strengthen the empirical foundation and reproducibility of the work.

read point-by-point responses
  1. Referee: [Abstract / Modeling] Abstract and modeling section: The claim that the cosine similarity metric can assess RMT threshold adequacy rests entirely on the untested premise that each weight matrix W equals a low-rank signal S plus noise N whose singular-value bulk obeys Marchenko-Pastur statistics. No empirical spectral density plots, Kolmogorov-Smirnov tests, or other checks against the RMT bulk are reported for the actual DNN matrices studied; without this, both the thresholds and the proposed metric lack a justified foundation.

    Authors: We agree that direct empirical verification of the Marchenko-Pastur bulk is essential to justify the modeling assumption and the proposed metric. While the low-rank-plus-noise decomposition draws on prior RMT applications to neural networks, the current manuscript does not include such checks for the specific matrices examined. In the revised version we will add (i) overlaid plots of the empirical singular-value density against the Marchenko-Pastur prediction for representative layers and (ii) Kolmogorov-Smirnov goodness-of-fit tests on the bulk eigenvalues, thereby providing the missing foundation for both the thresholds and the cosine-similarity diagnostic. revision: yes

  2. Referee: [Experiments] Experimental results section: The manuscript states that numerical experiments compare two threshold methods using the cosine metric, yet provides no information on the DNN architectures, layer dimensions, training datasets, or number of weight matrices examined. This omission prevents any assessment of whether the reported differences between methods are statistically meaningful or merely artifacts of particular matrices.

    Authors: We apologize for the lack of experimental detail. The revised manuscript will explicitly list the DNN architectures (e.g., ResNet-50, VGG-16), the exact dimensions of each weight matrix analyzed, the training datasets (CIFAR-10/100 and ImageNet), and the total number of matrices per experiment. We will also report standard errors or p-values for the cosine-similarity differences to allow readers to judge statistical significance. revision: yes

Circularity Check

0 steps flagged

No circularity: metric defined independently of RMT thresholds

full rationale

The paper models each weight matrix as signal plus noise, applies an RMT-derived threshold to obtain a low-rank approximation, and then defines a separate cosine-similarity metric between the singular vectors of the resulting signal matrix and the original weight matrix to evaluate threshold adequacy. No equation or step reduces the metric to the threshold by construction, no parameter is fitted on a subset and then relabeled as a prediction, and no load-bearing claim rests on a self-citation chain. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central modeling assumption that weight matrices decompose into signal plus RMT noise is taken as given without further justification in the abstract.

axioms (1)
  • domain assumption Each weight matrix is the sum of a low-rank signal matrix and a noise matrix whose singular values follow random matrix theory distributions.
    Explicitly stated in the abstract as the modeling basis for threshold selection.

pith-pipeline@v0.9.0 · 5380 in / 1031 out tokens · 78616 ms · 2026-05-16T22:40:13.221391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Zhang, S

    C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM 64 (2021) 107–115

  2. [2]

    Arpit, S

    D. Arpit, S. Jastrz˛ ebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y . Bengio, et al., A closer look at memorization in deep networks, in: International conference on machine learning, PMLR, 2017, pp. 233–242

  3. [3]

    Krogh, J

    A. Krogh, J. Hertz, A simple weight decay can improve generalization, Advances in neural information processing systems 4 (1991) 950–957

  4. [4]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958

  5. [5]

    S. Han, J. Pool, J. Tran, W. J. Dally, Learning both weights and connections for efficient neural network, in: Advances in neural information processing systems, 2015, pp. 1135–1143

  6. [6]

    X. Lu, S. Matsuda, T. Shimizu, S. Nakamura, Noise reduction based random matrix theory, in: Proceedings of the 6th International Symposium on Chinese Spoken Language Processing, 2008, pp. 1–4

  7. [7]

    Aparicio, M

    L. Aparicio, M. Bordyuh, A. J. Blumberg, R. Rabadan, A random matrix theory approach to denoise single-cell data, Patterns 1 (2020)

  8. [8]

    Plerou, P

    V . Plerou, P. Gopikrishnan, B. Rosenow, L. A. N. Amaral, T. Guhr, H. E. Stanley, Random matrix approach to cross correlations in financial data, Physical Review E 65 (2002) 066126

  9. [9]

    Thamm, M

    M. Thamm, M. Staats, B. Rosenow, Random matrix analysis of deep neural network weight matrices, Physical Review E 106 (2022) 054124

  10. [10]

    C. H. Martin, M. W. Mahoney, Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning, Journal of Machine Learning Research 22 (2021) 1–73. 10

  11. [11]

    C. H. Martin, T. Peng, M. W. Mahoney, Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data, Nature Communications 12 (2021) 1–13

  12. [12]

    X. Meng, J. Yao, Impact of classification difficulty on the weight matrices spectra in deep learning and application to early-stopping, Journal of Machine Learning Research 24 (2023) 1–40

  13. [13]

    N. P. Baskerville, D. Granziol, J. P. Keating, Appearance of random matrix theory in deep learning, Physica A: Statistical Mechanics and its Applications 590 (2022) 126742

  14. [14]

    H. K. Prakash, C. H. Martin, Grokking and generalization collapse: Insights from htsr theory, in: High-dimensional Learning Dynamics 2025, 2025

  15. [15]

    Staats, M

    M. Staats, M. Thamm, B. Rosenow, Boundary between noise and information applied to filtering neural network weight matrices, Physical Review E 108 (2023) L022302

  16. [16]

    Berlyand, E

    L. Berlyand, E. Sandier, Y . Shmalo, L. Zhang, Enhancing accuracy in deep learning using random matrix theory, Journal of Machine Learning 3 (2024) 347–412

  17. [17]

    Benaych-Georges, R

    F. Benaych-Georges, R. R. Nadakuditi, The singular values and vectors of low rank perturbations of large rectangular random matrices, Journal of Multivariate Analysis 111 (2012) 120–135

  18. [18]

    Z. T. Ke, Y . Ma, X. Lin, Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis, Journal of the American Statistical Association 118 (2023) 374–392

  19. [19]

    Glorot, Y

    X. Glorot, Y . Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256

  20. [20]

    Berlyand, E

    L. Berlyand, E. Sandier, Y . Shmalo, L. Zhang, Pruning deep neural networks via a combination of the marchenko- pastur distribution and regularization,https://arxiv.org/abs/2503.01922, 2025. ArXiv:2503.01922

  21. [21]

    Marcenko, L

    A. Marcenko, L. A. Pastur, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1 (1967) 457–483

  22. [22]

    I. M. Johnstone, On the distribution of the largest eigenvalue in principal components analysis, The Annals of Statistics 29 (2001) 295–327

  23. [23]

    Zhang, J

    X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016) 1943–1955

  24. [24]

    LeCun, L

    Y . LeCun, L. Bottou, Y . Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (2002) 2278–2324

  25. [25]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Com- munications of the ACM 60 (2017) 84–90. 11