Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

Alessandro Laio; Ali Hussaini Umar

arxiv: 2605.26973 · v1 · pith:COYRO74Jnew · submitted 2026-05-26 · 📊 stat.ML · cond-mat.dis-nn· cs.LG· cs.NE· q-bio.NC

Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

Ali Hussaini Umar , Alessandro Laio This is my paper

Pith reviewed 2026-06-29 15:44 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncs.LGcs.NEq-bio.NC

keywords representational alignmentsignal-to-noise ratiosample sizeinterpolation thresholdneural networksgeneralization errorlinear network

0 comments

The pith

Representational alignment in neural networks increases monotonically with signal-to-noise ratio but dips to a minimum near the interpolation threshold as sample size grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how similarly different neural networks represent data when each is trained on noisy versions of the same task. It perturbs training sets with independent noise realizations and measures alignment across ensembles of networks on both regression and classification problems. Alignment rises steadily as the signal-to-noise ratio improves. Alignment varies non-monotonically with the number of training examples, reaching its lowest point near the sample size at which the network can perfectly fit the training data. The same pattern appears in an exactly solvable linear network and in nonlinear networks on real data, and it is independent of generalization error.

Core claim

Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error.

What carries the argument

Training ensembles on independently noise-perturbed versions of the same dataset, with analytic alignment computation in a single-hidden-layer linear network.

If this is right

Alignment increases as the signal-to-noise ratio of the training data increases.
Alignment reaches a minimum near the interpolation threshold when sample size is varied.
Stronger alignment does not imply lower generalization error.
The same dependence on SNR and sample size appears in both the linear analytic model and nonlinear networks on real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data collection strategies could target sample sizes away from the interpolation threshold if high alignment is needed for a downstream task.
The observed decoupling suggests alignment and generalization are controlled by distinct aspects of the training dynamics.
Non-monotonic dependence on sample size may appear in other similarity measures between networks.

Load-bearing premise

The specific noise process and alignment metric make the linear network's behavior representative of nonlinear networks trained on real data.

What would settle it

Measure alignment in networks trained at multiple sample sizes around the interpolation threshold and find that it does not reach a minimum there.

Figures

Figures reproduced from arXiv: 2605.26973 by Alessandro Laio, Ali Hussaini Umar.

**Figure 1.** Figure 1: Representational alignment depends on the signal-to-noise (SNR) ratio and increases non-monotonically with training sample size (n). Panel (a): presents the generalization error of a two-layer linear network as a function of training sample size divided by the network’s parameters for different SNR (at global minimum). The solid lines correspond to the theoretical predictions, while the dots with bars deno… view at source ↗

**Figure 2.** Figure 2: Empirical results for non-linear neural network. Panel (a): presents the average generalization error of a two-layer network with ReLU activation as a function of training sample size divided by the network’s parameters for different SNR values. Panel (b): presents the average Conditional Copula Entropy between hidden representations of identical networks trained on independent datasets. Numerical results… view at source ↗

**Figure 3.** Figure 3: Alignment between Representations of Linear and Nonlinear Network. Panel (a): presents the average generalization error of a two-layer network with linear (orange curve) and ReLU (blue curve) activation as a function of training sample size divided by the network’s parameters for SNR= 5. Panel (b): presents the average Conditional Copula Entropy between hidden representations of networks with Linear and Re… view at source ↗

**Figure 4.** Figure 4: Representational alignment of independently trained classification networks: Panel (a) presents the generalization error of the trained networks for various label noise. Panel (b) presents the CCE between penultimate representations of identical networks trained on independent datasets, starting from the same initialization. All curves are plotted as a function of the ratio between the training sample size… view at source ↗

**Figure 5.** Figure 5: Representational alignment of independently trained classification networks: Panel (a) presents the generalization error of the trained networks for various label noise. Panel (b) presents the CCE between penultimate representations of identical networks trained on independent datasets, starting from the same initialization. All curves are plotted as a function of the ratio between the training sample size… view at source ↗

read the original abstract

Neural networks are known to develop latent representations that are $aligned$, namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple $linear$ network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core finding is that alignment between network representations rises monotonically with SNR but dips to a minimum near the interpolation threshold as sample size grows, and this pattern holds from an analytically solvable linear case through to nonlinear nets on real data while staying separate from generalization.

read the letter

The main thing to know is that alignment is governed by SNR and sample size in a non-monotonic way that bottoms out around interpolation, and the authors back this with both an exact linear calculation and matching behavior in deeper networks.

What the work actually adds is the explicit non-monotonic sample-size dependence and the clean decoupling from test error. The linear single-hidden-layer case is solved analytically, which gives an independent check rather than just curve-fitting. They then show the same qualitative trends appear across regression/classification, synthetic/real data, and different architectures. That consistency is the strongest part; it is not just restated prior results on similarity measures.

The soft spots are mostly around the modeling choices. The alignment metric and the specific additive noise process are defined by the authors, and while the stress-test note says they ran sensitivity checks, those choices still determine how far the claim travels beyond the controlled setting. The linear-to-nonlinear jump is presented as qualitative similarity rather than a tight quantitative match, so readers will want to see how much the numbers actually line up. No internal contradictions jump out from the abstract or the stress-test summary.

This paper is for people who care about representational similarity, dataset curation, or the interpolation regime. A reader already working on alignment metrics or double-descent phenomena will find the separation from generalization useful. It is grounded enough, with the analytic piece and the cross-setting replication, that it should go to referees rather than get desk-rejected.

Referee Report

0 major / 4 minor

Summary. The paper studies representational alignment across an ensemble of neural networks trained on independently noise-perturbed copies of the same dataset. It claims that alignment increases monotonically with signal-to-noise ratio (SNR) while varying non-monotonically with training sample size (minimum near the interpolation threshold), that these patterns hold for both an analytically solvable linear network and nonlinear networks on synthetic and real data, and that alignment is decoupled from generalization error. The linear-network derivation supplies an independent check on the empirical observations.

Significance. If the central claims hold, the work supplies a concrete, analytically grounded account of when and why representations align across models. The closed-form linear-network result is a clear strength, as is the consistency of the SNR and sample-size patterns across regression/classification, synthetic/real data, and linear/nonlinear regimes. The decoupling from generalization performance is a useful negative result for interpretability and ensembling research.

minor comments (4)

§3.2, Eq. (8): the alignment metric is defined via centered kernel alignment; a one-sentence reminder of its invariance properties would help readers connect it to the linear closed form.
Figure 4 caption: the interpolation threshold is marked but the precise definition (e.g., number of parameters vs. effective degrees of freedom) is not restated; readers must hunt in §4.1.
§5.3: the sensitivity analysis to noise-process parameters is present but reported only for two values of σ; a brief table or inset showing the monotonicity slope across a wider grid would be helpful.
Table 2: the R² values for the linear-network fit are given, but the number of independent noise realizations used to compute each point is not stated in the table or caption.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the work and the recommendation of minor revision. The report does not list any specific major comments.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central results are observational comparisons: alignment is shown to vary monotonically with SNR and non-monotonically with sample size (minimum near interpolation) both in an analytically solvable linear network and in nonlinear networks on real data. The linear analytic estimate is derived independently from the model equations rather than fitted to the nonlinear results, and the alignment metric is defined explicitly without reducing to a parameter fit by construction. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided abstract or reader summary. The derivation chain is self-contained against external benchmarks (analytic solution + controlled experiments), yielding a normal non-circular outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no free parameters, invented entities, or non-standard axioms are mentioned. The work relies on standard definitions of alignment and training dynamics.

axioms (1)

standard math Standard mathematical definitions of representational alignment and interpolation threshold apply to both linear and nonlinear networks.
Invoked implicitly when claiming qualitative similarity between the analytically solvable linear case and real networks.

pith-pipeline@v0.9.1-grok · 5720 in / 1235 out tokens · 31443 ms · 2026-06-29T15:44:14.318020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Acevedo, S., Mascaretti, A., Rende, R., Mahaut, M., Baroni, M., and Laio, A. (2025). A quantitative analysis of semantic information in deep representations of text and images.arXiv preprint arXiv:2505.17101

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

S., Saxe, A

Advani, M. S., Saxe, A. M., and Sompolinsky, H. (2020). High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446

2020
[3]

H., and Zoccolan, D

Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. (2019). Intrinsic dimension of data representations in deep neural networks.Advances in Neural Information Processing Systems, 32

2019
[4]

Arora, S., Cohen, N., Hu, W., and Luo, Y . (2019). Implicit regularization in deep matrix factorization.Advances in neural information processing systems, 32

2019
[5]

Atanasov, A., Bordelon, B., and Pehlevan, C. (2022). Neural networks as kernel learners: The silent alignment effect. InInternational Conference on Learning Representations

2022
[6]

W., et al

Bai, Z., Silverstein, J. W., et al. (2010).Spectral analysis of large dimensional random matrices, volume 20. Springer

2010
[7]

and Hornik, K

Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58

1989
[8]

Bansal, Y ., Nakkiran, P., and Barak, B. (2021). Revisiting model stitching to compare neural representations.Advances in neural information processing systems, 34:225–236

2021
[9]

Barbier, J., Camilli, F., Nguyen, M.-T., Pastore, M., and Skerk, R. (2025). Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation.arXiv preprint arXiv:2510.24616

work page arXiv 2025
[10]

J., Györfi, L., Van der Meulen, E

Beirlant, J., Dudewicz, E. J., Györfi, L., Van der Meulen, E. C., et al. (1997). Nonparametric entropy estimation: An overview.International Journal of Mathematical and Statistical Sciences, 6(1):17–39

1997
[11]

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854

2019
[12]

Bengio, Y ., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828

2013
[13]

Bishop, C. M. and Nasrabadi, N. M. (2006).Pattern recognition and machine learning, volume 4. Springer

2006
[14]

Braun, L., Grant, E., and Saxe, A. M. (2025). Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks. In Forty-second International Conference on Machine Learning

2025
[15]

and Berger, R

Casella, G. and Berger, R. (2024).Statistical inference. Chapman and Hall/CRC

2024
[16]

Cho, K., Van Merriënboer, B., Gulçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y . (2014). Learning phrase representations using rnn encoder–decoder for statistical machine translation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1724–1734. 10

2014
[17]

Cui, H., Krzakala, F., and Zdeborova, L. (2023). Bayes-optimal learning of deep random networks of extensive-width. InInternational Conference on Machine Learning, pages 6468–6521. PMLR

2023
[18]

Cui, H., Krzakala, F., and Zdeborová, L. (2025). Bayes-optimal learning of deep ran- dom networks of extensive-width.Journal of Statistical Mechanics: Theory and Experiment, 2025(1):014001

2025
[19]

Del Tatto, V ., Fortunato, G., Bueti, D., and Laio, A. (2024). Robust inference of causality in high- dimensional dynamical processes from the information imbalance of distance ranks.Proceedings of the National Academy of Sciences, 121(19):e2317256121

2024
[20]

C., Anguita, N., Proca, A

Dominé, C. C., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A., and Saxe, A. M. (2024). From lazy to rich: Exact learning dynamics in deep linear networks.arXiv preprint arXiv:2409.14623

work page arXiv 2024
[21]

Dominé, C. C. J., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A. M., and Saxe, A. M. (2025). From lazy to rich: Exact learning dynamics in deep linear networks. InThe Thirteenth International Conference on Learning Representations

2025
[22]

Glielmo, A., Zeni, C., Cheng, B., Csányi, G., and Laio, A. (2022). Ranking the information content of distance measures.PNAS nexus, 1(2):pgac039

2022
[23]

(2016).Deep learning, volume 1

Goodfellow, I., Bengio, Y ., Courville, A., and Bengio, Y . (2016).Deep learning, volume 1. MIT Press

2016
[24]

Gröger, F., Wen, S., and Brbi´c, M. (2026). Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Gu, Y ., Zheng, X., and Aste, T. (2024). Unraveling the enigma of double descent: An in-depth analysis through the lens of learned feature space. InThe Twelfth International Conference on Learning Representations

2024
[26]

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949

2022
[27]

E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V ., Nguyen, P., Sainath, T

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V ., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97

2012
[28]

Huh, M., Cheung, B., Wang, T., and Isola, P. (2024). The platonic representation hypothesis. arXiv preprint arXiv:2405.07987

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Jarvis, D., Lee, S., Carla Juliette Dominé, C., M Saxe, A., and Sarao Mannelli, S. (2025). A theory of initialisation’s impact on specialisation.Journal of Statistical Mechanics: Theory and Experiment, 2025(11):114001

2025
[30]

Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity.Advances in neural information processing systems, 32

2019
[31]

Kang, H., Canatar, A., and Chung, S. (2025). Spectral analysis of representational similarity with limited neurons.arXiv preprint arXiv:2502.19648

work page arXiv 2025
[32]

Klabunde, M., Schumacher, T., Strohmaier, M., and Lemmerich, F. (2025). Similarity of neural network models: A survey of functional and representational measures.ACM Computing Surveys, 57(9):1–52

2025
[33]

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. (2019). Similarity of neural network representations revisited. In Chaudhuri, K. and Salakhutdinov, R., editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR. 11

2019
[34]

Kraskov, A., Stögbauer, H., and Grassberger, P. (2004). Estimating mutual information.Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69(6):066138

2004
[35]

Kriegeskorte, N., Mur, M., and Bandettini, P. A. (2008). Representational similarity analysis- connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2:249

2008
[36]

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images

2009
[37]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25

2012
[38]

and Hertz, J

Krogh, A. and Hertz, J. A. (1992). Generalization in a linear perceptron in the presence of noise. Journal of Physics A: Mathematical and General, 25(5):1135–1147

1992
[39]

K., Chan, S

Lampinen, A. K., Chan, S. C., and Hermann, K. (2024). Learned feature representations are biased by complexity, learning order, position, and more.Transactions on Machine Learning Research

2024
[40]

K., Chan, S

Lampinen, A. K., Chan, S. C., Li, Y ., and Hermann, K. (2025). Representation biases: will we achieve complete understanding by analyzing representations?arXiv preprint arXiv:2507.22216

work page arXiv 2025
[41]

Lampinen, A. K. and Ganguli, S. (2018). An analytic theory of generalization dynamics and transfer learning in deep linear networks.arXiv preprint arXiv:1809.10374

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

LeCun, Y . (1998). The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/

1998
[43]

LeCun, Y ., Bengio, Y ., and Hinton, G. (2015). Deep learning.nature, 521(7553):436–444

2015
[44]

Li, Y ., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. (2015). Convergent learning: Do different neural networks learn the same representations?arXiv preprint arXiv:1511.07543

work page internal anchor Pith review Pith/arXiv arXiv 2015
[45]

Marˇcenko, V . A. and Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483

1967
[46]

and Montanari, A

Mei, S. and Montanari, A. (2022). The generalization error of random features regression: Pre- cise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766

2022
[47]

A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

Mendes, V . C., Bardone, L., Koller, C., Moreira, J. M., Erba, V ., Troiani, E., and Zdeborová, L. (2026). A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to pca while test loss misaligns with generalization.arXiv preprint arXiv:2602.10680

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Murphy, A., Zylberberg, J., and Fyshe, A. (2024). Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012

work page arXiv 2024
[49]

Nakkiran, P., Kaplun, G., Bansal, Y ., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003

2021
[50]

Nelsen, R. B. (2006).An introduction to copulas. Springer

2006
[51]

G., Athalye, A., and Mueller, J

Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks.arXiv preprint arXiv:2103.14749

work page arXiv 2021
[52]

Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. (2017). Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30

2017
[53]

Refinetti, M., Ingrosso, A., and Goldt, S. (2023). Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR

2023
[54]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120. 12

work page internal anchor Pith review Pith/arXiv arXiv 2013
[55]

M., McClelland, J

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2019). A mathematical theory of seman- tic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546

2019
[56]

C., Cueva, C

Sucholutsky, I., Muttenthaler, L., Weller, A., Peng, A., Bobu, A., Kim, B., Love, B. C., Cueva, C. J., Grant, E., Groen, I., Achterberg, J., Tenenbaum, J. B., Collins, K. M., Hermann, K., Oktar, K., Greff, K., Hebart, M. N., Cloos, N., Kriegeskorte, N., Jacoby, N., Zhang, Q., Marjieh, R., Geirhos, R., Chen, S., Kornblith, S., Rane, S., Konkle, T., O’Conne...

2025
[57]

H., Nando Tezoh, F

Umar, A. H., Nando Tezoh, F. K., Barbier, J., Acevedo, S., and Laio, A. (2026). The effect of label noise on the information content of neural representations.Frontiers in Physics, 14:1717253

2026
[58]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need.Advances in neural information processing systems, 30

2017
[59]

H., Kunz, E., Kornblith, S., and Linderman, S

Williams, A. H., Kunz, E., Kornblith, S., and Linderman, S. (2021). Generalized shape metrics on neural representations.Advances in neural information processing systems, 34:4738–4750

2021
[60]

D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N

Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N. (2020). Kernel and rich regimes in overparametrized models. InConference on Learning Theory, pages 3635–3673. PMLR

2020
[61]

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530. 13 A Related Work A.1 Measures of representational similarity/alignment. Quantifying when two information processing systems represent information in compatible ways is a long-standing questio...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

A single-layer FCNN with k hidden units, with input and output dimensions fixed to match the number of pixels in the input image and the number of classes in the task, respectively
[63]

The Maxpoll is [2,2,2,4]

A family of CNN consisting of 4 convolutional stages of width [k,2k,4k,8k] , where k is the width parameter, followed by a fully connected layer as a classifier. The Maxpoll is [2,2,2,4] . For the entire convolution layer, the kernel size = 3, stride = 1, and padding = 1. This architecture is identical to the one considered in [25] The networks are traine...

[1] [1]

Acevedo, S., Mascaretti, A., Rende, R., Mahaut, M., Baroni, M., and Laio, A. (2025). A quantitative analysis of semantic information in deep representations of text and images.arXiv preprint arXiv:2505.17101

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

S., Saxe, A

Advani, M. S., Saxe, A. M., and Sompolinsky, H. (2020). High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446

2020

[3] [3]

H., and Zoccolan, D

Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. (2019). Intrinsic dimension of data representations in deep neural networks.Advances in Neural Information Processing Systems, 32

2019

[4] [4]

Arora, S., Cohen, N., Hu, W., and Luo, Y . (2019). Implicit regularization in deep matrix factorization.Advances in neural information processing systems, 32

2019

[5] [5]

Atanasov, A., Bordelon, B., and Pehlevan, C. (2022). Neural networks as kernel learners: The silent alignment effect. InInternational Conference on Learning Representations

2022

[6] [6]

W., et al

Bai, Z., Silverstein, J. W., et al. (2010).Spectral analysis of large dimensional random matrices, volume 20. Springer

2010

[7] [7]

and Hornik, K

Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58

1989

[8] [8]

Bansal, Y ., Nakkiran, P., and Barak, B. (2021). Revisiting model stitching to compare neural representations.Advances in neural information processing systems, 34:225–236

2021

[9] [9]

Barbier, J., Camilli, F., Nguyen, M.-T., Pastore, M., and Skerk, R. (2025). Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation.arXiv preprint arXiv:2510.24616

work page arXiv 2025

[10] [10]

J., Györfi, L., Van der Meulen, E

Beirlant, J., Dudewicz, E. J., Györfi, L., Van der Meulen, E. C., et al. (1997). Nonparametric entropy estimation: An overview.International Journal of Mathematical and Statistical Sciences, 6(1):17–39

1997

[11] [11]

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854

2019

[12] [12]

Bengio, Y ., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828

2013

[13] [13]

Bishop, C. M. and Nasrabadi, N. M. (2006).Pattern recognition and machine learning, volume 4. Springer

2006

[14] [14]

Braun, L., Grant, E., and Saxe, A. M. (2025). Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks. In Forty-second International Conference on Machine Learning

2025

[15] [15]

and Berger, R

Casella, G. and Berger, R. (2024).Statistical inference. Chapman and Hall/CRC

2024

[16] [16]

Cho, K., Van Merriënboer, B., Gulçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y . (2014). Learning phrase representations using rnn encoder–decoder for statistical machine translation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1724–1734. 10

2014

[17] [17]

Cui, H., Krzakala, F., and Zdeborova, L. (2023). Bayes-optimal learning of deep random networks of extensive-width. InInternational Conference on Machine Learning, pages 6468–6521. PMLR

2023

[18] [18]

Cui, H., Krzakala, F., and Zdeborová, L. (2025). Bayes-optimal learning of deep ran- dom networks of extensive-width.Journal of Statistical Mechanics: Theory and Experiment, 2025(1):014001

2025

[19] [19]

Del Tatto, V ., Fortunato, G., Bueti, D., and Laio, A. (2024). Robust inference of causality in high- dimensional dynamical processes from the information imbalance of distance ranks.Proceedings of the National Academy of Sciences, 121(19):e2317256121

2024

[20] [20]

C., Anguita, N., Proca, A

Dominé, C. C., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A., and Saxe, A. M. (2024). From lazy to rich: Exact learning dynamics in deep linear networks.arXiv preprint arXiv:2409.14623

work page arXiv 2024

[21] [21]

Dominé, C. C. J., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A. M., and Saxe, A. M. (2025). From lazy to rich: Exact learning dynamics in deep linear networks. InThe Thirteenth International Conference on Learning Representations

2025

[22] [22]

Glielmo, A., Zeni, C., Cheng, B., Csányi, G., and Laio, A. (2022). Ranking the information content of distance measures.PNAS nexus, 1(2):pgac039

2022

[23] [23]

(2016).Deep learning, volume 1

Goodfellow, I., Bengio, Y ., Courville, A., and Bengio, Y . (2016).Deep learning, volume 1. MIT Press

2016

[24] [24]

Gröger, F., Wen, S., and Brbi´c, M. (2026). Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Gu, Y ., Zheng, X., and Aste, T. (2024). Unraveling the enigma of double descent: An in-depth analysis through the lens of learned feature space. InThe Twelfth International Conference on Learning Representations

2024

[26] [26]

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949

2022

[27] [27]

E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V ., Nguyen, P., Sainath, T

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V ., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97

2012

[28] [28]

Huh, M., Cheung, B., Wang, T., and Isola, P. (2024). The platonic representation hypothesis. arXiv preprint arXiv:2405.07987

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Jarvis, D., Lee, S., Carla Juliette Dominé, C., M Saxe, A., and Sarao Mannelli, S. (2025). A theory of initialisation’s impact on specialisation.Journal of Statistical Mechanics: Theory and Experiment, 2025(11):114001

2025

[30] [30]

Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity.Advances in neural information processing systems, 32

2019

[31] [31]

Kang, H., Canatar, A., and Chung, S. (2025). Spectral analysis of representational similarity with limited neurons.arXiv preprint arXiv:2502.19648

work page arXiv 2025

[32] [32]

Klabunde, M., Schumacher, T., Strohmaier, M., and Lemmerich, F. (2025). Similarity of neural network models: A survey of functional and representational measures.ACM Computing Surveys, 57(9):1–52

2025

[33] [33]

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. (2019). Similarity of neural network representations revisited. In Chaudhuri, K. and Salakhutdinov, R., editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR. 11

2019

[34] [34]

Kraskov, A., Stögbauer, H., and Grassberger, P. (2004). Estimating mutual information.Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69(6):066138

2004

[35] [35]

Kriegeskorte, N., Mur, M., and Bandettini, P. A. (2008). Representational similarity analysis- connecting the branches of systems neuroscience.Frontiers in systems neuroscience, 2:249

2008

[36] [36]

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images

2009

[37] [37]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25

2012

[38] [38]

and Hertz, J

Krogh, A. and Hertz, J. A. (1992). Generalization in a linear perceptron in the presence of noise. Journal of Physics A: Mathematical and General, 25(5):1135–1147

1992

[39] [39]

K., Chan, S

Lampinen, A. K., Chan, S. C., and Hermann, K. (2024). Learned feature representations are biased by complexity, learning order, position, and more.Transactions on Machine Learning Research

2024

[40] [40]

K., Chan, S

Lampinen, A. K., Chan, S. C., Li, Y ., and Hermann, K. (2025). Representation biases: will we achieve complete understanding by analyzing representations?arXiv preprint arXiv:2507.22216

work page arXiv 2025

[41] [41]

Lampinen, A. K. and Ganguli, S. (2018). An analytic theory of generalization dynamics and transfer learning in deep linear networks.arXiv preprint arXiv:1809.10374

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

LeCun, Y . (1998). The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/

1998

[43] [43]

LeCun, Y ., Bengio, Y ., and Hinton, G. (2015). Deep learning.nature, 521(7553):436–444

2015

[44] [44]

Li, Y ., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. (2015). Convergent learning: Do different neural networks learn the same representations?arXiv preprint arXiv:1511.07543

work page internal anchor Pith review Pith/arXiv arXiv 2015

[45] [45]

Marˇcenko, V . A. and Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483

1967

[46] [46]

and Montanari, A

Mei, S. and Montanari, A. (2022). The generalization error of random features regression: Pre- cise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766

2022

[47] [47]

A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

Mendes, V . C., Bardone, L., Koller, C., Moreira, J. M., Erba, V ., Troiani, E., and Zdeborová, L. (2026). A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to pca while test loss misaligns with generalization.arXiv preprint arXiv:2602.10680

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Murphy, A., Zylberberg, J., and Fyshe, A. (2024). Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012

work page arXiv 2024

[49] [49]

Nakkiran, P., Kaplun, G., Bansal, Y ., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003

2021

[50] [50]

Nelsen, R. B. (2006).An introduction to copulas. Springer

2006

[51] [51]

G., Athalye, A., and Mueller, J

Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks.arXiv preprint arXiv:2103.14749

work page arXiv 2021

[52] [52]

Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. (2017). Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems, 30

2017

[53] [53]

Refinetti, M., Ingrosso, A., and Goldt, S. (2023). Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR

2023

[54] [54]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120. 12

work page internal anchor Pith review Pith/arXiv arXiv 2013

[55] [55]

M., McClelland, J

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2019). A mathematical theory of seman- tic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546

2019

[56] [56]

C., Cueva, C

Sucholutsky, I., Muttenthaler, L., Weller, A., Peng, A., Bobu, A., Kim, B., Love, B. C., Cueva, C. J., Grant, E., Groen, I., Achterberg, J., Tenenbaum, J. B., Collins, K. M., Hermann, K., Oktar, K., Greff, K., Hebart, M. N., Cloos, N., Kriegeskorte, N., Jacoby, N., Zhang, Q., Marjieh, R., Geirhos, R., Chen, S., Kornblith, S., Rane, S., Konkle, T., O’Conne...

2025

[57] [57]

H., Nando Tezoh, F

Umar, A. H., Nando Tezoh, F. K., Barbier, J., Acevedo, S., and Laio, A. (2026). The effect of label noise on the information content of neural representations.Frontiers in Physics, 14:1717253

2026

[58] [58]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need.Advances in neural information processing systems, 30

2017

[59] [59]

H., Kunz, E., Kornblith, S., and Linderman, S

Williams, A. H., Kunz, E., Kornblith, S., and Linderman, S. (2021). Generalized shape metrics on neural representations.Advances in neural information processing systems, 34:4738–4750

2021

[60] [60]

D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N

Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N. (2020). Kernel and rich regimes in overparametrized models. InConference on Learning Theory, pages 3635–3673. PMLR

2020

[61] [61]

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530. 13 A Related Work A.1 Measures of representational similarity/alignment. Quantifying when two information processing systems represent information in compatible ways is a long-standing questio...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[62] [62]

A single-layer FCNN with k hidden units, with input and output dimensions fixed to match the number of pixels in the input image and the number of classes in the task, respectively

[63] [63]

The Maxpoll is [2,2,2,4]

A family of CNN consisting of 4 convolutional stages of width [k,2k,4k,8k] , where k is the width parameter, followed by a fully connected layer as a classifier. The Maxpoll is [2,2,2,4] . For the entire convolution layer, the kernel size = 3, stride = 1, and padding = 1. This architecture is identical to the one considered in [25] The networks are traine...