Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

Hiroki Hashiguchi; Kohei Nishikawa; Koki Shimizu

arxiv: 2512.12911 · v2 · submitted 2025-12-15 · 📊 stat.ML · cs.LG

Evaluating Singular Value Thresholds for DNN Weight Matrices based on Random Matrix Theory

Kohei Nishikawa , Koki Shimizu , Hiroki Hashiguchi This is my paper

Pith reviewed 2026-05-16 22:40 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords singular value decompositionrandom matrix theorydeep neural network weightslow-rank approximationcosine similaritythreshold selectionnoise removal

0 comments

The pith

A cosine similarity metric between singular vectors can check whether random matrix theory thresholds correctly separate signal from noise in DNN weight matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats each DNN weight matrix as the sum of a low-rank signal part and a noise part whose singular values follow random matrix theory. It removes the noise singular values using an RMT-derived threshold to produce a low-rank approximation. To decide whether the chosen threshold is good, the authors introduce a metric that measures the cosine similarity between the leading singular vectors of the recovered signal and the original matrix. Numerical tests on real networks then compare two different ways of estimating the threshold with this new similarity score.

Core claim

The paper shows that the cosine similarity between the singular vectors of the estimated signal matrix and the original weight matrix serves as a practical indicator of whether an RMT-based threshold has removed the right singular values, allowing direct comparison of threshold selection methods on actual DNN weights.

What carries the argument

The cosine similarity between singular vectors of the signal and original weight matrices, which quantifies how well an RMT threshold preserves the dominant directions of the weight matrix.

If this is right

Thresholds chosen by the better of the two RMT estimators will produce low-rank weight approximations whose dominant singular vectors remain aligned with the original matrix.
The same similarity check can be applied to decide how many singular values to keep in any SVD-based compression of a trained network.
Networks whose weight matrices pass the similarity test after thresholding are expected to retain more of their original functional behavior than those that do not.
The metric supplies an objective score for ranking different RMT threshold formulas without requiring retraining or validation accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The similarity test could be used at training time to decide when to switch from full-rank to low-rank layers.
It may generalize to other matrix factorizations such as non-negative matrix factorization when the goal is to keep the dominant directions intact.
Repeated application across layers could reveal which layers in a network are most sensitive to singular-value truncation.

Load-bearing premise

DNN weight matrices can be modeled as the sum of a low-rank signal matrix and additive noise whose singular value distribution matches random matrix theory predictions.

What would settle it

Run the proposed cosine similarity test on a trained DNN weight matrix; if the similarity stays low even after applying the RMT threshold while the singular value histogram clearly deviates from the Marchenko-Pastur law, the evaluation approach fails.

Figures

Figures reproduced from arXiv: 2512.12911 by Hiroki Hashiguchi, Kohei Nishikawa, Koki Shimizu.

**Figure 1.** Figure 1: shows the estimated MP distributions from W of a multilayer perceptron (MLP) trained on MNIST dataset. The red and blue curves indicate the MP distributions estimated by BEMA and Gaussian broadening, respectively, with the corresponding vertical lines representing the noise–information boundaries. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Singular value 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Probability density [PITH_FULL… view at source ↗

**Figure 2.** Figure 2: Metric Avew(ϕˆ) and test accuracy with respect to the estimated number of signal singular values (ˆs). Green circles (left y-axis) show Avew(ϕˆ), and purple squares (right y-axis) show test accuracy obtained after keeping the top ˆs singular values (others set to zero). Red and blue vertical lines indicate thresholds estimated by BEMA and Gaussian broadening, respectively. 6 [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 3.** Figure 3: Singular value distribution of the FC1 weight matrix in the MLP for different batch sizes. The dashed line represents the MP distribution estimated by BEMA, whereas the solid line indicates the threshold used to determine the number of singular values, ˆs, considered to represent the signal. batch size of 256, and the largest singular values are substantially larger than the others. A batch size of 256 lea… view at source ↗

read the original abstract

This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The cosine similarity metric for checking RMT thresholds on DNN weights is a practical addition, but the signal-plus-noise model rests on an unverified assumption.

read the letter

The paper introduces a cosine similarity score between the singular vectors kept after thresholding and those of the original weight matrix. This score serves as a diagnostic for whether an RMT-derived cutoff has removed the right singular values when approximating DNN weights as low-rank signal plus noise. They compare two standard threshold estimators on this basis in numerical experiments. The metric itself is new in this setting and gives a simple, label-free number that can be computed directly from the matrices. That is the concrete step forward. The experiments appear to run the comparison on actual weight matrices, which at least lets readers see numerical differences between the two estimators. The central modeling choice is that each weight matrix equals a low-rank signal plus a noise term whose singular values follow a random matrix law such as Marchenko-Pastur. Trained DNN weights are the output of gradient descent on highly structured data, so their empirical spectral distribution need not match the i.i.d. noise case that RMT assumes. The abstract and description give no indication that the authors checked how closely the observed bulk matches the theoretical prediction or how sensitive the cosine score is to departures from it. If the match is poor, both the threshold and the proposed diagnostic lose their justification. This work is aimed at people already using SVD for model compression who want a data-driven way to pick the cutoff. A reader who knows the RMT literature can extract the metric idea and test it themselves even if they remain skeptical of the noise model. I would send it to peer review so referees can examine the experimental details and press on the spectral assumption.

Referee Report

2 major / 1 minor

Summary. The paper models DNN weight matrices as the sum of a low-rank signal matrix and a noise matrix whose singular values follow random matrix theory (RMT) predictions such as the Marchenko-Pastur law. It obtains low-rank approximations by thresholding singular values using RMT-derived thresholds, and proposes a cosine similarity metric between the singular vectors of the resulting signal matrix and the original weight matrix as a way to evaluate threshold adequacy. Numerical experiments are used to compare two threshold estimation methods.

Significance. If the RMT modeling assumption holds for real DNN weights, the cosine-similarity diagnostic could provide a practical, ground-truth-free method for validating noise-removal thresholds in model compression pipelines. The work would then offer a concrete tool for practitioners choosing between RMT-based pruning heuristics. However, the significance is substantially reduced by the absence of any verification that DNN weight matrices exhibit the required RMT bulk spectrum.

major comments (2)

[Abstract / Modeling] Abstract and modeling section: The claim that the cosine similarity metric can assess RMT threshold adequacy rests entirely on the untested premise that each weight matrix W equals a low-rank signal S plus noise N whose singular-value bulk obeys Marchenko-Pastur statistics. No empirical spectral density plots, Kolmogorov-Smirnov tests, or other checks against the RMT bulk are reported for the actual DNN matrices studied; without this, both the thresholds and the proposed metric lack a justified foundation.
[Experiments] Experimental results section: The manuscript states that numerical experiments compare two threshold methods using the cosine metric, yet provides no information on the DNN architectures, layer dimensions, training datasets, or number of weight matrices examined. This omission prevents any assessment of whether the reported differences between methods are statistically meaningful or merely artifacts of particular matrices.

minor comments (1)

[Evaluation Metric] Notation for the cosine similarity is introduced without an explicit equation number; adding a numbered definition would improve traceability when the metric is later used in figures or tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements, which will strengthen the empirical foundation and reproducibility of the work.

read point-by-point responses

Referee: [Abstract / Modeling] Abstract and modeling section: The claim that the cosine similarity metric can assess RMT threshold adequacy rests entirely on the untested premise that each weight matrix W equals a low-rank signal S plus noise N whose singular-value bulk obeys Marchenko-Pastur statistics. No empirical spectral density plots, Kolmogorov-Smirnov tests, or other checks against the RMT bulk are reported for the actual DNN matrices studied; without this, both the thresholds and the proposed metric lack a justified foundation.

Authors: We agree that direct empirical verification of the Marchenko-Pastur bulk is essential to justify the modeling assumption and the proposed metric. While the low-rank-plus-noise decomposition draws on prior RMT applications to neural networks, the current manuscript does not include such checks for the specific matrices examined. In the revised version we will add (i) overlaid plots of the empirical singular-value density against the Marchenko-Pastur prediction for representative layers and (ii) Kolmogorov-Smirnov goodness-of-fit tests on the bulk eigenvalues, thereby providing the missing foundation for both the thresholds and the cosine-similarity diagnostic. revision: yes
Referee: [Experiments] Experimental results section: The manuscript states that numerical experiments compare two threshold methods using the cosine metric, yet provides no information on the DNN architectures, layer dimensions, training datasets, or number of weight matrices examined. This omission prevents any assessment of whether the reported differences between methods are statistically meaningful or merely artifacts of particular matrices.

Authors: We apologize for the lack of experimental detail. The revised manuscript will explicitly list the DNN architectures (e.g., ResNet-50, VGG-16), the exact dimensions of each weight matrix analyzed, the training datasets (CIFAR-10/100 and ImageNet), and the total number of matrices per experiment. We will also report standard errors or p-values for the cosine-similarity differences to allow readers to judge statistical significance. revision: yes

Circularity Check

0 steps flagged

No circularity: metric defined independently of RMT thresholds

full rationale

The paper models each weight matrix as signal plus noise, applies an RMT-derived threshold to obtain a low-rank approximation, and then defines a separate cosine-similarity metric between the singular vectors of the resulting signal matrix and the original weight matrix to evaluate threshold adequacy. No equation or step reduces the metric to the threshold by construction, no parameter is fitted on a subset and then relabeled as a prediction, and no load-bearing claim rests on a self-citation chain. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central modeling assumption that weight matrices decompose into signal plus RMT noise is taken as given without further justification in the abstract.

axioms (1)

domain assumption Each weight matrix is the sum of a low-rank signal matrix and a noise matrix whose singular values follow random matrix theory distributions.
Explicitly stated in the abstract as the modeling basis for threshold selection.

pith-pipeline@v0.9.0 · 5380 in / 1031 out tokens · 78616 ms · 2026-05-16T22:40:13.221391+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each weight matrix is modeled as the sum of signal and noise matrices... singular values of W_noise are known to follow the MP distribution
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ϕ_i = |⟨ũ_i, u_i⟩|² a.s. → ... (Benaych-Georges and Nadakuditi, 2012)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM 64 (2021) 107–115

work page 2021
[2]

Arpit, S

D. Arpit, S. Jastrz˛ ebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y . Bengio, et al., A closer look at memorization in deep networks, in: International conference on machine learning, PMLR, 2017, pp. 233–242

work page 2017
[3]

Krogh, J

A. Krogh, J. Hertz, A simple weight decay can improve generalization, Advances in neural information processing systems 4 (1991) 950–957

work page 1991
[4]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958

work page 2014
[5]

S. Han, J. Pool, J. Tran, W. J. Dally, Learning both weights and connections for efficient neural network, in: Advances in neural information processing systems, 2015, pp. 1135–1143

work page 2015
[6]

X. Lu, S. Matsuda, T. Shimizu, S. Nakamura, Noise reduction based random matrix theory, in: Proceedings of the 6th International Symposium on Chinese Spoken Language Processing, 2008, pp. 1–4

work page 2008
[7]

Aparicio, M

L. Aparicio, M. Bordyuh, A. J. Blumberg, R. Rabadan, A random matrix theory approach to denoise single-cell data, Patterns 1 (2020)

work page 2020
[8]

Plerou, P

V . Plerou, P. Gopikrishnan, B. Rosenow, L. A. N. Amaral, T. Guhr, H. E. Stanley, Random matrix approach to cross correlations in financial data, Physical Review E 65 (2002) 066126

work page 2002
[9]

Thamm, M

M. Thamm, M. Staats, B. Rosenow, Random matrix analysis of deep neural network weight matrices, Physical Review E 106 (2022) 054124

work page 2022
[10]

C. H. Martin, M. W. Mahoney, Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning, Journal of Machine Learning Research 22 (2021) 1–73. 10

work page 2021
[11]

C. H. Martin, T. Peng, M. W. Mahoney, Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data, Nature Communications 12 (2021) 1–13

work page 2021
[12]

X. Meng, J. Yao, Impact of classification difficulty on the weight matrices spectra in deep learning and application to early-stopping, Journal of Machine Learning Research 24 (2023) 1–40

work page 2023
[13]

N. P. Baskerville, D. Granziol, J. P. Keating, Appearance of random matrix theory in deep learning, Physica A: Statistical Mechanics and its Applications 590 (2022) 126742

work page 2022
[14]

H. K. Prakash, C. H. Martin, Grokking and generalization collapse: Insights from htsr theory, in: High-dimensional Learning Dynamics 2025, 2025

work page 2025
[15]

Staats, M

M. Staats, M. Thamm, B. Rosenow, Boundary between noise and information applied to filtering neural network weight matrices, Physical Review E 108 (2023) L022302

work page 2023
[16]

Berlyand, E

L. Berlyand, E. Sandier, Y . Shmalo, L. Zhang, Enhancing accuracy in deep learning using random matrix theory, Journal of Machine Learning 3 (2024) 347–412

work page 2024
[17]

Benaych-Georges, R

F. Benaych-Georges, R. R. Nadakuditi, The singular values and vectors of low rank perturbations of large rectangular random matrices, Journal of Multivariate Analysis 111 (2012) 120–135

work page 2012
[18]

Z. T. Ke, Y . Ma, X. Lin, Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis, Journal of the American Statistical Association 118 (2023) 374–392

work page 2023
[19]

Glorot, Y

X. Glorot, Y . Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256

work page 2010
[20]

Berlyand, E

L. Berlyand, E. Sandier, Y . Shmalo, L. Zhang, Pruning deep neural networks via a combination of the marchenko- pastur distribution and regularization,https://arxiv.org/abs/2503.01922, 2025. ArXiv:2503.01922

work page arXiv 2025
[21]

Marcenko, L

A. Marcenko, L. A. Pastur, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1 (1967) 457–483

work page 1967
[22]

I. M. Johnstone, On the distribution of the largest eigenvalue in principal components analysis, The Annals of Statistics 29 (2001) 295–327

work page 2001
[23]

Zhang, J

X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016) 1943–1955

work page 2016
[24]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (2002) 2278–2324

work page 2002
[25]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Com- munications of the ACM 60 (2017) 84–90. 11

work page 2017

[1] [1]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM 64 (2021) 107–115

work page 2021

[2] [2]

Arpit, S

D. Arpit, S. Jastrz˛ ebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y . Bengio, et al., A closer look at memorization in deep networks, in: International conference on machine learning, PMLR, 2017, pp. 233–242

work page 2017

[3] [3]

Krogh, J

A. Krogh, J. Hertz, A simple weight decay can improve generalization, Advances in neural information processing systems 4 (1991) 950–957

work page 1991

[4] [4]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958

work page 2014

[5] [5]

S. Han, J. Pool, J. Tran, W. J. Dally, Learning both weights and connections for efficient neural network, in: Advances in neural information processing systems, 2015, pp. 1135–1143

work page 2015

[6] [6]

X. Lu, S. Matsuda, T. Shimizu, S. Nakamura, Noise reduction based random matrix theory, in: Proceedings of the 6th International Symposium on Chinese Spoken Language Processing, 2008, pp. 1–4

work page 2008

[7] [7]

Aparicio, M

L. Aparicio, M. Bordyuh, A. J. Blumberg, R. Rabadan, A random matrix theory approach to denoise single-cell data, Patterns 1 (2020)

work page 2020

[8] [8]

Plerou, P

V . Plerou, P. Gopikrishnan, B. Rosenow, L. A. N. Amaral, T. Guhr, H. E. Stanley, Random matrix approach to cross correlations in financial data, Physical Review E 65 (2002) 066126

work page 2002

[9] [9]

Thamm, M

M. Thamm, M. Staats, B. Rosenow, Random matrix analysis of deep neural network weight matrices, Physical Review E 106 (2022) 054124

work page 2022

[10] [10]

C. H. Martin, M. W. Mahoney, Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning, Journal of Machine Learning Research 22 (2021) 1–73. 10

work page 2021

[11] [11]

C. H. Martin, T. Peng, M. W. Mahoney, Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data, Nature Communications 12 (2021) 1–13

work page 2021

[12] [12]

X. Meng, J. Yao, Impact of classification difficulty on the weight matrices spectra in deep learning and application to early-stopping, Journal of Machine Learning Research 24 (2023) 1–40

work page 2023

[13] [13]

N. P. Baskerville, D. Granziol, J. P. Keating, Appearance of random matrix theory in deep learning, Physica A: Statistical Mechanics and its Applications 590 (2022) 126742

work page 2022

[14] [14]

H. K. Prakash, C. H. Martin, Grokking and generalization collapse: Insights from htsr theory, in: High-dimensional Learning Dynamics 2025, 2025

work page 2025

[15] [15]

Staats, M

M. Staats, M. Thamm, B. Rosenow, Boundary between noise and information applied to filtering neural network weight matrices, Physical Review E 108 (2023) L022302

work page 2023

[16] [16]

Berlyand, E

L. Berlyand, E. Sandier, Y . Shmalo, L. Zhang, Enhancing accuracy in deep learning using random matrix theory, Journal of Machine Learning 3 (2024) 347–412

work page 2024

[17] [17]

Benaych-Georges, R

F. Benaych-Georges, R. R. Nadakuditi, The singular values and vectors of low rank perturbations of large rectangular random matrices, Journal of Multivariate Analysis 111 (2012) 120–135

work page 2012

[18] [18]

Z. T. Ke, Y . Ma, X. Lin, Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis, Journal of the American Statistical Association 118 (2023) 374–392

work page 2023

[19] [19]

Glorot, Y

X. Glorot, Y . Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256

work page 2010

[20] [20]

Berlyand, E

L. Berlyand, E. Sandier, Y . Shmalo, L. Zhang, Pruning deep neural networks via a combination of the marchenko- pastur distribution and regularization,https://arxiv.org/abs/2503.01922, 2025. ArXiv:2503.01922

work page arXiv 2025

[21] [21]

Marcenko, L

A. Marcenko, L. A. Pastur, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1 (1967) 457–483

work page 1967

[22] [22]

I. M. Johnstone, On the distribution of the largest eigenvalue in principal components analysis, The Annals of Statistics 29 (2001) 295–327

work page 2001

[23] [23]

Zhang, J

X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016) 1943–1955

work page 2016

[24] [24]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (2002) 2278–2324

work page 2002

[25] [25]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Com- munications of the ACM 60 (2017) 84–90. 11

work page 2017