pith. sign in

arxiv: 2605.20105 · v1 · pith:ORG2DZ5Onew · submitted 2026-05-19 · 💻 cs.LG

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords representation learningpretraininglinear probinghigh-dimensional asymptoticsgeneralization errorPCAdata trade-off
0
0 comments X

The pith

The optimal size of a pretrained representation depends on how much unlabelled versus labelled data is available, with compression helping most when pretraining data is abundant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an exact high-dimensional model of the standard pretrain-then-probe pipeline. Structure is extracted by principal component analysis on unlabelled data; a downstream linear regressor is then trained on a separate labelled set. Closed-form expressions for training and test error reveal that the best representation dimension is not fixed: it shrinks to the lowest useful rank when unlabelled data is plentiful and labelled data is scarce, but grows when unlabelled data is limited. The same formulas also deliver a precise exchange rate stating how many unlabelled samples substitute for one labelled sample. The predicted dependence on representation size appears again in autoencoders and in large language models.

Core claim

In the high-dimensional regime the generalization error after linear probing is an explicit function of representation dimensionality d, unlabelled sample size n_u, labelled sample size n_l, and task alignment. The value of d that minimises this error is maximal compression when n_u is large and n_l small, and larger d when n_u is small. The same expressions yield an exact trade-off: the number of additional unlabelled samples needed to compensate for the loss of one labelled sample.

What carries the argument

Closed-form high-dimensional expressions for generalization error after PCA-based pretraining followed by linear probing on the retained components.

If this is right

  • When unlabelled data greatly exceeds labelled data, keeping only the top few principal components after pretraining minimises downstream error.
  • When unlabelled data is limited, retaining higher-dimensional representations improves generalization by preserving more task-relevant directions.
  • The derived formulas give a precise numerical trade-off: the exact quantity of unlabelled samples that can replace one labelled sample while keeping error constant.
  • The same non-monotonic dependence of optimal dimension on data regime is observed in trained autoencoders and in pretrained large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In practice the bottleneck dimension chosen during pretraining should be treated as a hyperparameter tuned to the expected scarcity of downstream labels.
  • If the leading linear modes dominate the useful structure, the same optimal-size rule may apply approximately to nonlinear representation learners.
  • The explicit trade-off supplies a quantitative target for deciding how much additional unlabelled data justifies reducing the labelled set size in a given application.

Load-bearing premise

The analysis treats structure extraction as principal component analysis on unlabelled data and downstream learning as linear regression on the resulting representation.

What would settle it

Measure test error while systematically varying the number of retained principal components, the size of the unlabelled pretraining set, and the size of the labelled probing set; check whether the error-minimising dimension moves toward smaller values exactly as predicted when labelled data becomes scarcer.

Figures

Figures reproduced from arXiv: 2605.20105 by Andrew Saxe, Cl\'ementine Domin\'e, Marco Mondelli, Rachel Swanson, Valentina Njaradi.

Figure 1
Figure 1. Figure 1: Analytically tractable model of two-stage learning. (a) In the pretraining stage, PCA extracts the top-m principal components from nu unlabelled samples. The downstream task with nl labelled samples is learned via regression on inputs projected through Pm = UmU⊤ m. Theoretically derived generalisation (b) and training errors (c) match numerical simulations. Section 3, for any deterministic vectors a, b wit… view at source ↗
Figure 2
Figure 2. Figure 2: Benefits of optimal representation size. (a) Minimum achievable generalisation error with optimal α ⋆ . (b) Gain in generalisation error from using α ⋆ instead of all PCs (α = 1). (c) Same gain relative to α ≈ 0. (d) Marginal rate of substitution between unlabelled and labelled data, (∂Egen ∞ /∂nu) / (∂Egen ∞ /∂nl); values above 1 (red) indicate that unlabelled data reduces the error more than labelled dat… view at source ↗
Figure 3
Figure 3. Figure 3: When is compression useful. (a) Heatmap of the generalisation-optimal α ⋆ reveals distinct phases; white curves mark the theoretical phase-transition boundaries where low α is optimal. Phase boundaries in the (nu, nl) plane for varying SNR (b) varying λ (c) and varying η (d). Non-varying parameters are λ = 5, SNR = 9 and η = 1. Simulation details are given in Appendix D. Generalisation error. We find that … view at source ↗
Figure 4
Figure 4. Figure 4: Optimal representation in autoencoders and pretrained LLMs. Optimal representation size as a function of unlabelled (nu) and labelled (nl) sample sizes. Linear autoencoders are considered in panels (a,c) and nonlinear ones in panels (b,d). Autoencoders are trained on Gaussian data with either spiked identity covariance Σ = Ip + λvv⊤ (panels (a,b)) or a spiked Toeplitz covariance Σ = H + λvv⊤, where Hi,j = … view at source ↗
Figure 5
Figure 5. Figure 5: Generalisation error of PCR as a function of the number of retained components [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Components of the theoretical estimation error as a function of [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Components of the theoretical generalisation error as a function of [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Optimal value of α that minimizes the generalisation error, shown as a function of SNR and spike alignment η. We fix λ = 2. Red lines indicate the approximate phase transitions. Finally, we analyse deviations from PCA in low-sample regimes in [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Optimal value of α that minimizes the training error, shown as a function of SNR and spike alignment η. We fix λ = 2. C.3 Additional transformer experiments We first consider the setting where PCA is applied to representations extracted from the downstream task ( [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Optimal value of α that minimizes the generalisation error, shown as a function of spike strength λ and spike alignment θ with w∗ . We fix SNR = 9. Red lines indicate the approximate phase transitions. models trained for longer durations. In contrast, for large downstream datasets, performance is maximized without compression (m ≈ p). At the smallest dataset sizes, the optimal m is less stable across runs… view at source ↗
Figure 11
Figure 11. Figure 11: Optimal value of α that minimizes the training error, shown as a function of spike strength λ and spike alignment θ with w∗ . We fix SNR = 9. 1 724 1448 2172 3000 Unlabelled nu 1 724 1448 2172 3000 L a b elle d nl Task-aligned 1 724 1448 2172 3000 Unlabelled nu 1 724 1448 2172 3000 L a b elle d nl Task-misaligned PR PCR Regression Best model (within 2% tie threshold) [PITH_FULL_IMAGE:figures/full_fig_p04… view at source ↗
Figure 12
Figure 12. Figure 12: Heatmaps over the (nu, nl) grid (p = 500, SNR = 1.8, λ = 5) showing which method – standard regression (violet), pretrained regression (PR, blue) or principal component regression (PCR, orange) – achieves lower generalisation error. Forward hatching (//) marks regions where PR and PCR are tied (within 2% error); cross hatching (×) marks regions where all three methods (PR, PCR, and standard regression) ar… view at source ↗
Figure 13
Figure 13. Figure 13: Extended version of panels (a-d) in [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of SNR on optimal bottleneck size in autoencoders on inputs with different covariance structure. We [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Spike direction recovery as a function of sample ratio [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Optimal m using PCA on representations of the downstream task. We use the same layout as [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Validation accuracy vs number of PCA components [PITH_FULL_IMAGE:figures/full_fig_p044_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Eigenvalue spectra of Pythia-70M-deduped last-token last-hidden-layer representations at five pretraining checkpoints. Each panel shows the empirical distribution of covariance eigenvalues computed from the representation of the sst2 task. Throughout pretraining, the spectrum shows a clear bulk and a few outlying eigenvalues. As pretraining progresses (steps 128 to 143,000), the outlying eigenvalues becom… view at source ↗
Figure 19
Figure 19. Figure 19: Validation accuracy vs pretraining checkpoint for three probing conditions across five downstream task [PITH_FULL_IMAGE:figures/full_fig_p045_19.png] view at source ↗
read the original abstract

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript develops an analytical high-dimensional model of the pretraining-plus-linear-probing pipeline. Structure extraction is formalized as PCA on unlabeled data of size n_u, and downstream learning as linear regression on a separate labeled set of size n_l. Exact closed-form expressions are derived for training and generalization error as functions of representation dimension k, n_u, n_l, and a task-alignment parameter. The optimal k is obtained by minimizing the generalization error, yielding the regimes that maximal compression is optimal when pretraining data is abundant and downstream labels are scarce, while higher-dimensional representations are preferable when pretraining data is limited. An exact quantitative trade-off is given between the amount of unlabeled data needed to replace one labeled sample. The same qualitative phenomenology is reported for autoencoders and pretrained LLMs.

Significance. If the derivations hold, the paper supplies a precise, parameter-dependent characterization of when and why compression during pretraining improves downstream generalization, together with an explicit data-efficiency trade-off. The closed-form results and the consistency with observations on autoencoders and LLMs provide both theoretical insight and practical guidance for representation-size selection in modern pipelines.

major comments (1)
  1. §3 (high-dimensional analysis): the exact expressions for generalization error are stated to follow from the PCA-plus-linear-regression model, but the precise assumptions on the data-generating process (eigenvalue decay, alignment of the task vector with the principal components) are not restated in the main text; without them the claimed parameter-free character of the optimal-k formula cannot be verified directly from the provided derivations.
minor comments (2)
  1. Figure 2: the legend for the LLM curves does not indicate the number of runs or error bars; adding this information would clarify the strength of the reported agreement with the theoretical curves.
  2. Notation: the task-alignment parameter is introduced in Eq. (3) but subsequently referred to by several different symbols in the text; a single consistent symbol would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: §3 (high-dimensional analysis): the exact expressions for generalization error are stated to follow from the PCA-plus-linear-regression model, but the precise assumptions on the data-generating process (eigenvalue decay, alignment of the task vector with the principal components) are not restated in the main text; without them the claimed parameter-free character of the optimal-k formula cannot be verified directly from the provided derivations.

    Authors: We agree that the main text would benefit from a concise restatement of the key modeling assumptions. In the revised manuscript we will add a short paragraph at the start of §3 that summarizes the eigenvalue decay law and the alignment of the task vector with the principal components. This change will allow readers to verify the derivations and the parameter-free character of the optimal-k formula without consulting the appendix, while leaving the technical results unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs an explicit analytical model with PCA on unlabelled data for structure extraction and linear regression on a separate labelled dataset for downstream learning. In the high-dimensional regime it derives closed-form expressions for training and generalisation error that depend on representation dimension k, unlabelled size n_u, labelled size n_l, and task-alignment parameter. The optimal k is obtained by minimising the generalisation error with respect to these quantities, directly producing the stated regimes without reducing to any fitted parameter from the target data or to a self-citation chain. Similar phenomenology reported for autoencoders and pretrained LLMs supplies independent external support. The derivation is therefore self-contained against the model's stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on modeling pretraining exactly as PCA and probing as linear regression, plus the high-dimensional asymptotic regime that enables closed-form solutions.

free parameters (1)
  • task alignment parameter
    The error expressions depend on task alignment with the principal components; this quantity is treated as an input parameter of the model.
axioms (2)
  • domain assumption High-dimensional regime in which number of features greatly exceeds number of samples
    Invoked to obtain exact closed-form expressions for training and generalization error.
  • domain assumption Structure extraction exactly equals principal component analysis on unlabeled data
    Stated explicitly as the formalization of pretraining.

pith-pipeline@v0.9.0 · 5780 in / 1478 out tokens · 50648 ms · 2026-05-20T06:49:05.537835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2024

  2. [2]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  3. [3]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

  4. [4]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

  5. [5]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

  6. [6]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  7. [7]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022

  8. [8]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning. PMLR, 2020

  9. [9]

    Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

    Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

  10. [10]

    The surprising effectiveness of test-time training for few-shot learning

    Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning. PMLR, 2025

  11. [11]

    Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021

  12. [12]

    Universal language model fine-tuning for text classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018

  13. [13]

    Greedy layer-wise training of deep networks

    Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InAdvances in Neural Information Processing Systems, volume 19, 2006

  14. [14]

    Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010

    Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010

  15. [15]

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 2010

  16. [16]

    Representation learning: A review and new perspectives

    Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 2013

  17. [17]

    The representation of object concepts in the brain.Annu

    Alex Martin. The representation of object concepts in the brain.Annu. Rev. Psychol., 58, 2007

  18. [18]

    Why neurons mix: high dimensionality for higher cognition

    Stefano Fusi, Earl K Miller, and Mattia Rigotti. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37, 2016. 10 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

  19. [19]

    Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019

    German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019

  20. [20]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017

  21. [21]

    Continual learning through synaptic intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 2017

  22. [22]

    Exact learning dynamics of deep linear networks with prior knowledge

    Lukas Braun, Clémentine Carla Juliette Dominé, James E Fitzgerald, and Andrew M Saxe. Exact learning dynamics of deep linear networks with prior knowledge. InAdvances in Neural Information Processing Systems, 2022

  23. [23]

    The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021

    David Badre, Apoorva Bhandari, Haley Keglovits, and Atsushi Kikumoto. The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021

  24. [24]

    Boyle, Lorenzo Posani, Sarah Irfan, Steven A

    Lara M. Boyle, Lorenzo Posani, Sarah Irfan, Steven A. Siegelbaum, and Stefano Fusi. Tuned geometries of hippocampal representations meet the computational demands of social memory.Neuron, 112, 2024

  25. [25]

    Courellis, Juri Minxha, Araceli R

    Hristos S. Courellis, Juri Minxha, Araceli R. Cardenas, Daniel L. Kimmel, Chrystal M. Reed, Taufik A. Valiante, C. Daniel Salzman, Adam N. Mamelak, Stefano Fusi, and Ueli Rutishauser. Abstract representations emerge in human hippocampal neurons during inference.Nature, 632, 2024

  26. [26]

    Karyna Mishchanchuk, Gabrielle Gregoriou, Albert Qü, Alizée Kastler, Quentin J. M. Huys, Linda Wilbrecht, and Andrew F. MacAskill. Hidden state inference requires abstract contextual representations in the ventral hippocampus.Science, 386, 2024

  27. [27]

    Rodgers, Randy M

    Ramon Nogueira, Chris C. Rodgers, Randy M. Bruno, and Stefano Fusi. The geometry of cortical representations of touch in rodents.Nature Neuroscience, 26, 2023

  28. [28]

    The geometry of hidden representations of large transformer models

    Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  29. [29]

    High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018

    Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018

  30. [30]

    Tibshirani

    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50, 2022

  31. [31]

    Asymptotics of ridge(less) regression under general source condition

    Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. InInternational Conference on Artificial Intelligence and Statistics, 2020

  32. [32]

    On the optimal weighted \ell_2 regularization in overparameterized linear regression

    Denny Wu and Ji Xu. On the optimal weighted \ell_2 regularization in overparameterized linear regression. In Advances in Neural Information Processing Systems, volume 33, 2020

  33. [33]

    Bartlett, Philip M

    Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117, 2020

  34. [34]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019

  35. [35]

    Advani, Andrew M

    Madhu S. Advani, Andrew M. Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132, 2020

  36. [36]

    The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026

    Qiyang Han and Xiaocong Xu. The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026

  37. [37]

    Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization

    Simone Bombari and Marco Mondelli. Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization. InInternational Conference on Machine Learning, 2025

  38. [38]

    Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025

    Edwige Cyffers, Alireza Mirrokni, and Marco Mondelli. Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025. 11 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

  39. [39]

    High-dimensional analysis of synthetic data selection

    Parham Rezaei, Filip Kovacevic, Francesco Locatello, and Marco Mondelli. High-dimensional analysis of synthetic data selection. InInternational Conference on Learning Representations, 2026

  40. [40]

    Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak

    M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong generalization and scaling laws. InInternational Conference on Learning Representations, 2025

  41. [41]

    Towards a statistical theory of data selection under weak supervision

    Germain Kolossov, Andrea Montanari, and Pulkit Tandon. Towards a statistical theory of data selection under weak supervision. InInternational Conference on Learning Representations, 2024

  42. [42]

    Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024

    Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024

  43. [43]

    Jolliffe

    Ian T. Jolliffe. A Note on the Use of Principal Components in Regression.Applied Statistics, 31, 1982

  44. [44]

    On the number of variables to use in principal component regression

    Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. InAdvances in Neural Information Processing Systems, volume 32, 2019

  45. [45]

    The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025

    Alden Green and Elad Romanov. The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025

  46. [46]

    William F. Massy. Principal components regression in exploratory statistical research.Journal of the American Statistical Association, 60, 1965

  47. [47]

    Dhillon, Dean P

    Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison of ordinary least squares vs ridge regression.Journal of Machine Learning Research, 14, 2013

  48. [48]

    Cambridge series in statistical and probabilistic mathematics

    Jianfeng Yao, Zhidong Bai, and Shui-Rong Zheng.Large sample covariance matrices and high-dimensional data analysis. Cambridge series in statistical and probabilistic mathematics. Cambridge university press, 2015

  49. [49]

    A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010

    Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010

  50. [50]

    A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020

    Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020

  51. [51]

    Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022

    Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, and Lenka Zdeborová. Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022

  52. [52]

    Lampinen and Surya Ganguli

    Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. InInternational Conference on Learning Representations, 2019

  53. [53]

    Rotskoff

    Javan Tahir, Surya Ganguli, and Grant M. Rotskoff. Features are fate: a theory of transfer learning in high- dimensional regression. InForty-second International Conference on Machine Learning, 2025

  54. [54]

    Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025

    Fan Yang, Hongyang R Zhang, Sen Wu, Christopher Re, and Weijie J Su. Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025

  55. [55]

    Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024

    Yanke Song, Sohom Bhattacharya, and Pragya Sur. Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024

  56. [56]

    Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo

    Jason D. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

  57. [57]

    Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data

    Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  58. [58]

    A Theoretical Analysis of Contrastive Unsupervised Representation Learning

    Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning.arXiv preprint arXiv:1902.09229, 2019

  59. [59]

    Hashimoto

    Tianyi Zhang and Tatsunori B. Hashimoto. On the inductive bias of masked language modeling: From statistical to syntactic dependencies. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021. 12 Optimal Representation Size: High-Dimensional Analysis of Pretrain...

  60. [60]

    Provable benefits of unsupervised pre-training and transfer learning via single-index models

    Taj Jones-Mccormick, Aukosh Jagannath, and Subhabrata Sen. Provable benefits of unsupervised pre-training and transfer learning via single-index models. InProceedings of the 42nd International Conference on Machine Learning, volume 267. PMLR, 2025

  61. [61]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2018

  62. [62]

    Jeffrey Johnston, and Stefano Fusi

    Bin Wang, W. Jeffrey Johnston, and Stefano Fusi. A mathematical theory for understanding when abstract representations emerge in neural networks.arXiv preprint arXiv:2510.09816, 2026

  63. [63]

    Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine

    Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine. A theory of how pretraining shapes inductive bias in fine-tuning.arXiv preprint arXiv:2602.20062, 2026

  64. [64]

    Johnstone

    Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of Statistics, 29(2), 2001

  65. [65]

    Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005

    Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005

  66. [66]

    V . A. Marˇcenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1, 1967

  67. [67]

    Silverstein.Spectral Analysis of Large Dimensional Random Matrices

    Zhidong Bai and Jack W. Silverstein.Spectral Analysis of Large Dimensional Random Matrices. Springer Series in Statistics. Springer New York, 2010

  68. [68]

    Eigenvalues of large sample covariance matrices of spiked population models

    Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97, 2006

  69. [69]

    Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007

    Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007

  70. [70]

    On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012

    Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012

  71. [71]

    Learning in the presence of low- dimensional structure: a spiked random matrix perspective

    Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low- dimensional structure: a spiked random matrix perspective. InAdvances in Neural Information Processing Systems, volume 36, 2023

  72. [72]

    When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020

  73. [73]

    Gradient-based feature learning under structured data

    Alireza Mousavi-Hosseini, Denny Wu, Taiji Suzuki, and Murat A Erdogdu. Gradient-based feature learning under structured data. InAdvances in Neural Information Processing Systems, volume 36, 2023

  74. [74]

    High-dimensional asymptotics of feature learning: How one gradient step improves the representation

    Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. InAdvances in Neural Information Processing Systems, volume 35, 2022

  75. [75]

    Asymptotics of feature learning in two-layer networks after one gradient-step

    Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, and Bruno Loureiro. Asymptotics of feature learning in two-layer networks after one gradient-step. InInternational Conference on Machine Learning, 2024

  76. [76]

    A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities

    Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities. InInternational Conference on Artificial Intelligence and Statistics, pages 2224–2232. PMLR, 2025

  77. [77]

    Montúfar

    Rishi Sonthalia, Michael Murray, and Guido F. Montúfar. Low rank gradients and where to find them. InAdvances in Neural Information Processing Systems, 2025

  78. [78]

    Ribeiro, and Thomas B

    Daniel Gedon, Antônio H. Ribeiro, and Thomas B. Schön. No double descent in principal component regression: A high-dimensional analysis. InInternational Conference on Machine Learning, 2024. 13 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

  79. [79]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

  80. [80]

    Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

    Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

Showing first 80 references.