Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing
Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3
The pith
The optimal size of a pretrained representation depends on how much unlabelled versus labelled data is available, with compression helping most when pretraining data is abundant.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the high-dimensional regime the generalization error after linear probing is an explicit function of representation dimensionality d, unlabelled sample size n_u, labelled sample size n_l, and task alignment. The value of d that minimises this error is maximal compression when n_u is large and n_l small, and larger d when n_u is small. The same expressions yield an exact trade-off: the number of additional unlabelled samples needed to compensate for the loss of one labelled sample.
What carries the argument
Closed-form high-dimensional expressions for generalization error after PCA-based pretraining followed by linear probing on the retained components.
If this is right
- When unlabelled data greatly exceeds labelled data, keeping only the top few principal components after pretraining minimises downstream error.
- When unlabelled data is limited, retaining higher-dimensional representations improves generalization by preserving more task-relevant directions.
- The derived formulas give a precise numerical trade-off: the exact quantity of unlabelled samples that can replace one labelled sample while keeping error constant.
- The same non-monotonic dependence of optimal dimension on data regime is observed in trained autoencoders and in pretrained large language models.
Where Pith is reading between the lines
- In practice the bottleneck dimension chosen during pretraining should be treated as a hyperparameter tuned to the expected scarcity of downstream labels.
- If the leading linear modes dominate the useful structure, the same optimal-size rule may apply approximately to nonlinear representation learners.
- The explicit trade-off supplies a quantitative target for deciding how much additional unlabelled data justifies reducing the labelled set size in a given application.
Load-bearing premise
The analysis treats structure extraction as principal component analysis on unlabelled data and downstream learning as linear regression on the resulting representation.
What would settle it
Measure test error while systematically varying the number of retained principal components, the size of the unlabelled pretraining set, and the size of the labelled probing set; check whether the error-minimising dimension moves toward smaller values exactly as predicted when labelled data becomes scarcer.
Figures
read the original abstract
Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops an analytical high-dimensional model of the pretraining-plus-linear-probing pipeline. Structure extraction is formalized as PCA on unlabeled data of size n_u, and downstream learning as linear regression on a separate labeled set of size n_l. Exact closed-form expressions are derived for training and generalization error as functions of representation dimension k, n_u, n_l, and a task-alignment parameter. The optimal k is obtained by minimizing the generalization error, yielding the regimes that maximal compression is optimal when pretraining data is abundant and downstream labels are scarce, while higher-dimensional representations are preferable when pretraining data is limited. An exact quantitative trade-off is given between the amount of unlabeled data needed to replace one labeled sample. The same qualitative phenomenology is reported for autoencoders and pretrained LLMs.
Significance. If the derivations hold, the paper supplies a precise, parameter-dependent characterization of when and why compression during pretraining improves downstream generalization, together with an explicit data-efficiency trade-off. The closed-form results and the consistency with observations on autoencoders and LLMs provide both theoretical insight and practical guidance for representation-size selection in modern pipelines.
major comments (1)
- §3 (high-dimensional analysis): the exact expressions for generalization error are stated to follow from the PCA-plus-linear-regression model, but the precise assumptions on the data-generating process (eigenvalue decay, alignment of the task vector with the principal components) are not restated in the main text; without them the claimed parameter-free character of the optimal-k formula cannot be verified directly from the provided derivations.
minor comments (2)
- Figure 2: the legend for the LLM curves does not indicate the number of runs or error bars; adding this information would clarify the strength of the reported agreement with the theoretical curves.
- Notation: the task-alignment parameter is introduced in Eq. (3) but subsequently referred to by several different symbols in the text; a single consistent symbol would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: §3 (high-dimensional analysis): the exact expressions for generalization error are stated to follow from the PCA-plus-linear-regression model, but the precise assumptions on the data-generating process (eigenvalue decay, alignment of the task vector with the principal components) are not restated in the main text; without them the claimed parameter-free character of the optimal-k formula cannot be verified directly from the provided derivations.
Authors: We agree that the main text would benefit from a concise restatement of the key modeling assumptions. In the revised manuscript we will add a short paragraph at the start of §3 that summarizes the eigenvalue decay law and the alignment of the task vector with the principal components. This change will allow readers to verify the derivations and the parameter-free character of the optimal-k formula without consulting the appendix, while leaving the technical results unchanged. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper constructs an explicit analytical model with PCA on unlabelled data for structure extraction and linear regression on a separate labelled dataset for downstream learning. In the high-dimensional regime it derives closed-form expressions for training and generalisation error that depend on representation dimension k, unlabelled size n_u, labelled size n_l, and task-alignment parameter. The optimal k is obtained by minimising the generalisation error with respect to these quantities, directly producing the stated regimes without reducing to any fitted parameter from the target data or to a self-citation chain. Similar phenomenology reported for autoencoders and pretrained LLMs supplies independent external support. The derivation is therefore self-contained against the model's stated assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- task alignment parameter
axioms (2)
- domain assumption High-dimensional regime in which number of features greatly exceeds number of samples
- domain assumption Structure extraction exactly equals principal component analysis on unlabeled data
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
exact expressions for training and generalisation error ... optimal representation size as a function of task parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenAI et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Aaron Grattafiori et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
GLUE: A multi-task benchmark and analysis platform for natural language understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018
work page 2018
-
[4]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021
work page 2021
-
[5]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[6]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[7]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022
work page 2022
-
[8]
Test-time training with self-supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning. PMLR, 2020
work page 2020
-
[9]
Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021
work page 2021
-
[10]
The surprising effectiveness of test-time training for few-shot learning
Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning. PMLR, 2025
work page 2025
-
[11]
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021
work page 2021
-
[12]
Universal language model fine-tuning for text classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018
work page 2018
-
[13]
Greedy layer-wise training of deep networks
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InAdvances in Neural Information Processing Systems, volume 19, 2006
work page 2006
-
[14]
Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010
work page 2010
-
[15]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 2010
work page 2010
-
[16]
Representation learning: A review and new perspectives
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 2013
work page 2013
-
[17]
The representation of object concepts in the brain.Annu
Alex Martin. The representation of object concepts in the brain.Annu. Rev. Psychol., 58, 2007
work page 2007
-
[18]
Why neurons mix: high dimensionality for higher cognition
Stefano Fusi, Earl K Miller, and Mattia Rigotti. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37, 2016. 10 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT
work page 2016
-
[19]
Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019
German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019
work page 2019
-
[20]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017
work page 2017
-
[21]
Continual learning through synaptic intelligence
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 2017
work page 2017
-
[22]
Exact learning dynamics of deep linear networks with prior knowledge
Lukas Braun, Clémentine Carla Juliette Dominé, James E Fitzgerald, and Andrew M Saxe. Exact learning dynamics of deep linear networks with prior knowledge. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[23]
David Badre, Apoorva Bhandari, Haley Keglovits, and Atsushi Kikumoto. The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021
work page 2021
-
[24]
Boyle, Lorenzo Posani, Sarah Irfan, Steven A
Lara M. Boyle, Lorenzo Posani, Sarah Irfan, Steven A. Siegelbaum, and Stefano Fusi. Tuned geometries of hippocampal representations meet the computational demands of social memory.Neuron, 112, 2024
work page 2024
-
[25]
Courellis, Juri Minxha, Araceli R
Hristos S. Courellis, Juri Minxha, Araceli R. Cardenas, Daniel L. Kimmel, Chrystal M. Reed, Taufik A. Valiante, C. Daniel Salzman, Adam N. Mamelak, Stefano Fusi, and Ueli Rutishauser. Abstract representations emerge in human hippocampal neurons during inference.Nature, 632, 2024
work page 2024
-
[26]
Karyna Mishchanchuk, Gabrielle Gregoriou, Albert Qü, Alizée Kastler, Quentin J. M. Huys, Linda Wilbrecht, and Andrew F. MacAskill. Hidden state inference requires abstract contextual representations in the ventral hippocampus.Science, 386, 2024
work page 2024
-
[27]
Ramon Nogueira, Chris C. Rodgers, Randy M. Bruno, and Stefano Fusi. The geometry of cortical representations of touch in rodents.Nature Neuroscience, 26, 2023
work page 2023
-
[28]
The geometry of hidden representations of large transformer models
Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[29]
Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018
work page 2018
-
[30]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50, 2022
work page 2022
-
[31]
Asymptotics of ridge(less) regression under general source condition
Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. InInternational Conference on Artificial Intelligence and Statistics, 2020
work page 2020
-
[32]
On the optimal weighted \ell_2 regularization in overparameterized linear regression
Denny Wu and Ji Xu. On the optimal weighted \ell_2 regularization in overparameterized linear regression. In Advances in Neural Information Processing Systems, volume 33, 2020
work page 2020
-
[33]
Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117, 2020
work page 2020
-
[34]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019
work page 2019
-
[35]
Madhu S. Advani, Andrew M. Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132, 2020
work page 2020
-
[36]
Qiyang Han and Xiaocong Xu. The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026
work page 2026
-
[37]
Simone Bombari and Marco Mondelli. Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization. InInternational Conference on Machine Learning, 2025
work page 2025
-
[38]
Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025
Edwige Cyffers, Alireza Mirrokni, and Marco Mondelli. Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025. 11 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT
-
[39]
High-dimensional analysis of synthetic data selection
Parham Rezaei, Filip Kovacevic, Francesco Locatello, and Marco Mondelli. High-dimensional analysis of synthetic data selection. InInternational Conference on Learning Representations, 2026
work page 2026
-
[40]
Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak
M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong generalization and scaling laws. InInternational Conference on Learning Representations, 2025
work page 2025
-
[41]
Towards a statistical theory of data selection under weak supervision
Germain Kolossov, Andrea Montanari, and Pulkit Tandon. Towards a statistical theory of data selection under weak supervision. InInternational Conference on Learning Representations, 2024
work page 2024
-
[42]
Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024
work page 2024
- [43]
-
[44]
On the number of variables to use in principal component regression
Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[45]
Alden Green and Elad Romanov. The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025
work page 2025
-
[46]
William F. Massy. Principal components regression in exploratory statistical research.Journal of the American Statistical Association, 60, 1965
work page 1965
-
[47]
Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison of ordinary least squares vs ridge regression.Journal of Machine Learning Research, 14, 2013
work page 2013
-
[48]
Cambridge series in statistical and probabilistic mathematics
Jianfeng Yao, Zhidong Bai, and Shui-Rong Zheng.Large sample covariance matrices and high-dimensional data analysis. Cambridge series in statistical and probabilistic mathematics. Cambridge university press, 2015
work page 2015
-
[49]
A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010
Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010
work page 2010
-
[50]
A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020
work page 2020
-
[51]
Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, and Lenka Zdeborová. Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022
work page 2022
-
[52]
Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. InInternational Conference on Learning Representations, 2019
work page 2019
- [53]
-
[54]
Fan Yang, Hongyang R Zhang, Sen Wu, Christopher Re, and Weijie J Su. Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025
work page 2025
-
[55]
Yanke Song, Sohom Bhattacharya, and Pragya Sur. Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024
-
[56]
Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo
Jason D. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021
work page 2021
-
[57]
Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data
Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[58]
A Theoretical Analysis of Contrastive Unsupervised Representation Learning
Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning.arXiv preprint arXiv:1902.09229, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[59]
Tianyi Zhang and Tatsunori B. Hashimoto. On the inductive bias of masked language modeling: From statistical to syntactic dependencies. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021. 12 Optimal Representation Size: High-Dimensional Analysis of Pretrain...
work page 2021
-
[60]
Provable benefits of unsupervised pre-training and transfer learning via single-index models
Taj Jones-Mccormick, Aukosh Jagannath, and Subhabrata Sen. Provable benefits of unsupervised pre-training and transfer learning via single-index models. InProceedings of the 42nd International Conference on Machine Learning, volume 267. PMLR, 2025
work page 2025
-
[61]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[62]
Jeffrey Johnston, and Stefano Fusi
Bin Wang, W. Jeffrey Johnston, and Stefano Fusi. A mathematical theory for understanding when abstract representations emerge in neural networks.arXiv preprint arXiv:2510.09816, 2026
-
[63]
Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine
Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine. A theory of how pretraining shapes inductive bias in fine-tuning.arXiv preprint arXiv:2602.20062, 2026
- [64]
-
[65]
Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005
work page 2005
-
[66]
V . A. Marˇcenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1, 1967
work page 1967
-
[67]
Silverstein.Spectral Analysis of Large Dimensional Random Matrices
Zhidong Bai and Jack W. Silverstein.Spectral Analysis of Large Dimensional Random Matrices. Springer Series in Statistics. Springer New York, 2010
work page 2010
-
[68]
Eigenvalues of large sample covariance matrices of spiked population models
Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97, 2006
work page 2006
-
[69]
Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007
work page 2007
-
[70]
Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012
work page 2012
-
[71]
Learning in the presence of low- dimensional structure: a spiked random matrix perspective
Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low- dimensional structure: a spiked random matrix perspective. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[72]
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020
work page 2020
-
[73]
Gradient-based feature learning under structured data
Alireza Mousavi-Hosseini, Denny Wu, Taiji Suzuki, and Murat A Erdogdu. Gradient-based feature learning under structured data. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[74]
High-dimensional asymptotics of feature learning: How one gradient step improves the representation
Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. InAdvances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[75]
Asymptotics of feature learning in two-layer networks after one gradient-step
Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, and Bruno Loureiro. Asymptotics of feature learning in two-layer networks after one gradient-step. InInternational Conference on Machine Learning, 2024
work page 2024
-
[76]
Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities. InInternational Conference on Artificial Intelligence and Statistics, pages 2224–2232. PMLR, 2025
work page 2025
- [77]
-
[78]
Daniel Gedon, Antônio H. Ribeiro, and Thomas B. Schön. No double descent in principal component regression: A high-dimensional analysis. InInternational Conference on Machine Learning, 2024. 13 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT
work page 2024
-
[79]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023
work page 2023
-
[80]
Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022
Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.