Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Andrew Saxe; Cl\'ementine Domin\'e; Marco Mondelli; Rachel Swanson; Valentina Njaradi

arxiv: 2605.20105 · v1 · pith:ORG2DZ5Onew · submitted 2026-05-19 · 💻 cs.LG

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

Valentina Njaradi , Cl\'ementine Domin\'e , Rachel Swanson , Marco Mondelli , Andrew Saxe This is my paper

Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords representation learningpretraininglinear probinghigh-dimensional asymptoticsgeneralization errorPCAdata trade-off

0 comments

The pith

The optimal size of a pretrained representation depends on how much unlabelled versus labelled data is available, with compression helping most when pretraining data is abundant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an exact high-dimensional model of the standard pretrain-then-probe pipeline. Structure is extracted by principal component analysis on unlabelled data; a downstream linear regressor is then trained on a separate labelled set. Closed-form expressions for training and test error reveal that the best representation dimension is not fixed: it shrinks to the lowest useful rank when unlabelled data is plentiful and labelled data is scarce, but grows when unlabelled data is limited. The same formulas also deliver a precise exchange rate stating how many unlabelled samples substitute for one labelled sample. The predicted dependence on representation size appears again in autoencoders and in large language models.

Core claim

In the high-dimensional regime the generalization error after linear probing is an explicit function of representation dimensionality d, unlabelled sample size n_u, labelled sample size n_l, and task alignment. The value of d that minimises this error is maximal compression when n_u is large and n_l small, and larger d when n_u is small. The same expressions yield an exact trade-off: the number of additional unlabelled samples needed to compensate for the loss of one labelled sample.

What carries the argument

Closed-form high-dimensional expressions for generalization error after PCA-based pretraining followed by linear probing on the retained components.

If this is right

When unlabelled data greatly exceeds labelled data, keeping only the top few principal components after pretraining minimises downstream error.
When unlabelled data is limited, retaining higher-dimensional representations improves generalization by preserving more task-relevant directions.
The derived formulas give a precise numerical trade-off: the exact quantity of unlabelled samples that can replace one labelled sample while keeping error constant.
The same non-monotonic dependence of optimal dimension on data regime is observed in trained autoencoders and in pretrained large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In practice the bottleneck dimension chosen during pretraining should be treated as a hyperparameter tuned to the expected scarcity of downstream labels.
If the leading linear modes dominate the useful structure, the same optimal-size rule may apply approximately to nonlinear representation learners.
The explicit trade-off supplies a quantitative target for deciding how much additional unlabelled data justifies reducing the labelled set size in a given application.

Load-bearing premise

The analysis treats structure extraction as principal component analysis on unlabelled data and downstream learning as linear regression on the resulting representation.

What would settle it

Measure test error while systematically varying the number of retained principal components, the size of the unlabelled pretraining set, and the size of the labelled probing set; check whether the error-minimising dimension moves toward smaller values exactly as predicted when labelled data becomes scarcer.

Figures

Figures reproduced from arXiv: 2605.20105 by Andrew Saxe, Cl\'ementine Domin\'e, Marco Mondelli, Rachel Swanson, Valentina Njaradi.

**Figure 1.** Figure 1: Analytically tractable model of two-stage learning. (a) In the pretraining stage, PCA extracts the top-m principal components from nu unlabelled samples. The downstream task with nl labelled samples is learned via regression on inputs projected through Pm = UmU⊤ m. Theoretically derived generalisation (b) and training errors (c) match numerical simulations. Section 3, for any deterministic vectors a, b wit… view at source ↗

**Figure 2.** Figure 2: Benefits of optimal representation size. (a) Minimum achievable generalisation error with optimal α ⋆ . (b) Gain in generalisation error from using α ⋆ instead of all PCs (α = 1). (c) Same gain relative to α ≈ 0. (d) Marginal rate of substitution between unlabelled and labelled data, (∂Egen ∞ /∂nu) / (∂Egen ∞ /∂nl); values above 1 (red) indicate that unlabelled data reduces the error more than labelled dat… view at source ↗

**Figure 3.** Figure 3: When is compression useful. (a) Heatmap of the generalisation-optimal α ⋆ reveals distinct phases; white curves mark the theoretical phase-transition boundaries where low α is optimal. Phase boundaries in the (nu, nl) plane for varying SNR (b) varying λ (c) and varying η (d). Non-varying parameters are λ = 5, SNR = 9 and η = 1. Simulation details are given in Appendix D. Generalisation error. We find that … view at source ↗

**Figure 4.** Figure 4: Optimal representation in autoencoders and pretrained LLMs. Optimal representation size as a function of unlabelled (nu) and labelled (nl) sample sizes. Linear autoencoders are considered in panels (a,c) and nonlinear ones in panels (b,d). Autoencoders are trained on Gaussian data with either spiked identity covariance Σ = Ip + λvv⊤ (panels (a,b)) or a spiked Toeplitz covariance Σ = H + λvv⊤, where Hi,j = … view at source ↗

**Figure 5.** Figure 5: Generalisation error of PCR as a function of the number of retained components [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗

**Figure 6.** Figure 6: Components of the theoretical estimation error as a function of [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗

**Figure 7.** Figure 7: Components of the theoretical generalisation error as a function of [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Optimal value of α that minimizes the generalisation error, shown as a function of SNR and spike alignment η. We fix λ = 2. Red lines indicate the approximate phase transitions. Finally, we analyse deviations from PCA in low-sample regimes in [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

**Figure 9.** Figure 9: Optimal value of α that minimizes the training error, shown as a function of SNR and spike alignment η. We fix λ = 2. C.3 Additional transformer experiments We first consider the setting where PCA is applied to representations extracted from the downstream task ( [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: Optimal value of α that minimizes the generalisation error, shown as a function of spike strength λ and spike alignment θ with w∗ . We fix SNR = 9. Red lines indicate the approximate phase transitions. models trained for longer durations. In contrast, for large downstream datasets, performance is maximized without compression (m ≈ p). At the smallest dataset sizes, the optimal m is less stable across runs… view at source ↗

**Figure 11.** Figure 11: Optimal value of α that minimizes the training error, shown as a function of spike strength λ and spike alignment θ with w∗ . We fix SNR = 9. 1 724 1448 2172 3000 Unlabelled nu 1 724 1448 2172 3000 L a b elle d nl Task-aligned 1 724 1448 2172 3000 Unlabelled nu 1 724 1448 2172 3000 L a b elle d nl Task-misaligned PR PCR Regression Best model (within 2% tie threshold) [PITH_FULL_IMAGE:figures/full_fig_p04… view at source ↗

**Figure 12.** Figure 12: Heatmaps over the (nu, nl) grid (p = 500, SNR = 1.8, λ = 5) showing which method – standard regression (violet), pretrained regression (PR, blue) or principal component regression (PCR, orange) – achieves lower generalisation error. Forward hatching (//) marks regions where PR and PCR are tied (within 2% error); cross hatching (×) marks regions where all three methods (PR, PCR, and standard regression) ar… view at source ↗

**Figure 13.** Figure 13: Extended version of panels (a-d) in [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

**Figure 14.** Figure 14: Effect of SNR on optimal bottleneck size in autoencoders on inputs with different covariance structure. We [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗

**Figure 15.** Figure 15: Spike direction recovery as a function of sample ratio [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗

**Figure 16.** Figure 16: Optimal m using PCA on representations of the downstream task. We use the same layout as [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗

**Figure 17.** Figure 17: Validation accuracy vs number of PCA components [PITH_FULL_IMAGE:figures/full_fig_p044_17.png] view at source ↗

**Figure 18.** Figure 18: Eigenvalue spectra of Pythia-70M-deduped last-token last-hidden-layer representations at five pretraining checkpoints. Each panel shows the empirical distribution of covariance eigenvalues computed from the representation of the sst2 task. Throughout pretraining, the spectrum shows a clear bulk and a few outlying eigenvalues. As pretraining progresses (steps 128 to 143,000), the outlying eigenvalues becom… view at source ↗

**Figure 19.** Figure 19: Validation accuracy vs pretraining checkpoint for three probing conditions across five downstream task [PITH_FULL_IMAGE:figures/full_fig_p045_19.png] view at source ↗

read the original abstract

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives exact closed-form expressions for optimal representation size in a high-dimensional PCA pretraining plus linear probing model, with clear regimes based on unlabeled versus labeled data amounts.

read the letter

Here's the quick take: this paper works out exact expressions for the optimal representation size in a high-dimensional pretraining and linear probing setup, showing clear regimes where compression helps or hurts based on the amounts of unlabeled versus labeled data. They model structure extraction via PCA on the unlabeled set and then do linear regression on the labeled set for the downstream task. In the high-d limit they get closed forms for the errors in terms of the representation dimension k, the sample sizes, and an alignment parameter. Minimizing the generalization error with respect to k gives the optimal size, and they also extract a precise trade-off quantifying unlabeled data needed to stand in for labeled data. The derivations look clean and they back the phenomenology with observations on autoencoders and actual pretrained LLMs, which is a nice touch. This adds some credibility that the qualitative behavior isn't just an artifact of the linear model. The main limitation is that the whole thing is built on PCA and linear regression, which is a deliberate simplification. Real pretraining involves nonlinear networks and more involved objectives, so translating the exact numbers to practice will take work. The LLM checks are more qualitative, so they don't fully bridge the gap. Still, no obvious holes in the math or hidden assumptions that undermine the claims. This is the kind of paper that would interest people doing theoretical work on why pretraining works or how to choose representation dimensions efficiently. If you're thinking about data-efficient learning or high-dimensional stats in ML, it's worth a look. I'd recommend putting it through peer review. The results are novel enough and the analysis is sharp, even with the idealized setting.

Referee Report

1 major / 2 minor

Summary. The manuscript develops an analytical high-dimensional model of the pretraining-plus-linear-probing pipeline. Structure extraction is formalized as PCA on unlabeled data of size n_u, and downstream learning as linear regression on a separate labeled set of size n_l. Exact closed-form expressions are derived for training and generalization error as functions of representation dimension k, n_u, n_l, and a task-alignment parameter. The optimal k is obtained by minimizing the generalization error, yielding the regimes that maximal compression is optimal when pretraining data is abundant and downstream labels are scarce, while higher-dimensional representations are preferable when pretraining data is limited. An exact quantitative trade-off is given between the amount of unlabeled data needed to replace one labeled sample. The same qualitative phenomenology is reported for autoencoders and pretrained LLMs.

Significance. If the derivations hold, the paper supplies a precise, parameter-dependent characterization of when and why compression during pretraining improves downstream generalization, together with an explicit data-efficiency trade-off. The closed-form results and the consistency with observations on autoencoders and LLMs provide both theoretical insight and practical guidance for representation-size selection in modern pipelines.

major comments (1)

§3 (high-dimensional analysis): the exact expressions for generalization error are stated to follow from the PCA-plus-linear-regression model, but the precise assumptions on the data-generating process (eigenvalue decay, alignment of the task vector with the principal components) are not restated in the main text; without them the claimed parameter-free character of the optimal-k formula cannot be verified directly from the provided derivations.

minor comments (2)

Figure 2: the legend for the LLM curves does not indicate the number of runs or error bars; adding this information would clarify the strength of the reported agreement with the theoretical curves.
Notation: the task-alignment parameter is introduced in Eq. (3) but subsequently referred to by several different symbols in the text; a single consistent symbol would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: §3 (high-dimensional analysis): the exact expressions for generalization error are stated to follow from the PCA-plus-linear-regression model, but the precise assumptions on the data-generating process (eigenvalue decay, alignment of the task vector with the principal components) are not restated in the main text; without them the claimed parameter-free character of the optimal-k formula cannot be verified directly from the provided derivations.

Authors: We agree that the main text would benefit from a concise restatement of the key modeling assumptions. In the revised manuscript we will add a short paragraph at the start of §3 that summarizes the eigenvalue decay law and the alignment of the task vector with the principal components. This change will allow readers to verify the derivations and the parameter-free character of the optimal-k formula without consulting the appendix, while leaving the technical results unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs an explicit analytical model with PCA on unlabelled data for structure extraction and linear regression on a separate labelled dataset for downstream learning. In the high-dimensional regime it derives closed-form expressions for training and generalisation error that depend on representation dimension k, unlabelled size n_u, labelled size n_l, and task-alignment parameter. The optimal k is obtained by minimising the generalisation error with respect to these quantities, directly producing the stated regimes without reducing to any fitted parameter from the target data or to a self-citation chain. Similar phenomenology reported for autoencoders and pretrained LLMs supplies independent external support. The derivation is therefore self-contained against the model's stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on modeling pretraining exactly as PCA and probing as linear regression, plus the high-dimensional asymptotic regime that enables closed-form solutions.

free parameters (1)

task alignment parameter
The error expressions depend on task alignment with the principal components; this quantity is treated as an input parameter of the model.

axioms (2)

domain assumption High-dimensional regime in which number of features greatly exceeds number of samples
Invoked to obtain exact closed-form expressions for training and generalization error.
domain assumption Structure extraction exactly equals principal component analysis on unlabeled data
Stated explicitly as the formalization of pretraining.

pith-pipeline@v0.9.0 · 5780 in / 1478 out tokens · 50648 ms · 2026-05-20T06:49:05.537835+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exact expressions for training and generalisation error ... optimal representation size as a function of task parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

OpenAI et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

work page 2018
[4]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021
[5]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[6]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[7]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022

work page 2022
[8]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning. PMLR, 2020

work page 2020
[9]

Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[10]

The surprising effectiveness of test-time training for few-shot learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning. PMLR, 2025

work page 2025
[11]

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021

work page 2021
[12]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018

work page 2018
[13]

Greedy layer-wise training of deep networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InAdvances in Neural Information Processing Systems, volume 19, 2006

work page 2006
[14]

Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010

work page 2010
[15]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 2010

work page 2010
[16]

Representation learning: A review and new perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 2013

work page 2013
[17]

The representation of object concepts in the brain.Annu

Alex Martin. The representation of object concepts in the brain.Annu. Rev. Psychol., 58, 2007

work page 2007
[18]

Why neurons mix: high dimensionality for higher cognition

Stefano Fusi, Earl K Miller, and Mattia Rigotti. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37, 2016. 10 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

work page 2016
[19]

Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019

work page 2019
[20]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017

work page 2017
[21]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 2017

work page 2017
[22]

Exact learning dynamics of deep linear networks with prior knowledge

Lukas Braun, Clémentine Carla Juliette Dominé, James E Fitzgerald, and Andrew M Saxe. Exact learning dynamics of deep linear networks with prior knowledge. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[23]

The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021

David Badre, Apoorva Bhandari, Haley Keglovits, and Atsushi Kikumoto. The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021

work page 2021
[24]

Boyle, Lorenzo Posani, Sarah Irfan, Steven A

Lara M. Boyle, Lorenzo Posani, Sarah Irfan, Steven A. Siegelbaum, and Stefano Fusi. Tuned geometries of hippocampal representations meet the computational demands of social memory.Neuron, 112, 2024

work page 2024
[25]

Courellis, Juri Minxha, Araceli R

Hristos S. Courellis, Juri Minxha, Araceli R. Cardenas, Daniel L. Kimmel, Chrystal M. Reed, Taufik A. Valiante, C. Daniel Salzman, Adam N. Mamelak, Stefano Fusi, and Ueli Rutishauser. Abstract representations emerge in human hippocampal neurons during inference.Nature, 632, 2024

work page 2024
[26]

Karyna Mishchanchuk, Gabrielle Gregoriou, Albert Qü, Alizée Kastler, Quentin J. M. Huys, Linda Wilbrecht, and Andrew F. MacAskill. Hidden state inference requires abstract contextual representations in the ventral hippocampus.Science, 386, 2024

work page 2024
[27]

Rodgers, Randy M

Ramon Nogueira, Chris C. Rodgers, Randy M. Bruno, and Stefano Fusi. The geometry of cortical representations of touch in rodents.Nature Neuroscience, 26, 2023

work page 2023
[28]

The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[29]

High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018

Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018

work page 2018
[30]

Tibshirani

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50, 2022

work page 2022
[31]

Asymptotics of ridge(less) regression under general source condition

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. InInternational Conference on Artificial Intelligence and Statistics, 2020

work page 2020
[32]

On the optimal weighted \ell_2 regularization in overparameterized linear regression

Denny Wu and Ji Xu. On the optimal weighted \ell_2 regularization in overparameterized linear regression. In Advances in Neural Information Processing Systems, volume 33, 2020

work page 2020
[33]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117, 2020

work page 2020
[34]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019

work page 2019
[35]

Advani, Andrew M

Madhu S. Advani, Andrew M. Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132, 2020

work page 2020
[36]

The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026

Qiyang Han and Xiaocong Xu. The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026

work page 2026
[37]

Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization

Simone Bombari and Marco Mondelli. Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization. InInternational Conference on Machine Learning, 2025

work page 2025
[38]

Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025

Edwige Cyffers, Alireza Mirrokni, and Marco Mondelli. Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025. 11 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

work page arXiv 2025
[39]

High-dimensional analysis of synthetic data selection

Parham Rezaei, Filip Kovacevic, Francesco Locatello, and Marco Mondelli. High-dimensional analysis of synthetic data selection. InInternational Conference on Learning Representations, 2026

work page 2026
[40]

Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak

M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong generalization and scaling laws. InInternational Conference on Learning Representations, 2025

work page 2025
[41]

Towards a statistical theory of data selection under weak supervision

Germain Kolossov, Andrea Montanari, and Pulkit Tandon. Towards a statistical theory of data selection under weak supervision. InInternational Conference on Learning Representations, 2024

work page 2024
[42]

Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024

Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024

work page 2024
[43]

Jolliffe

Ian T. Jolliffe. A Note on the Use of Principal Components in Regression.Applied Statistics, 31, 1982

work page 1982
[44]

On the number of variables to use in principal component regression

Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[45]

The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025

Alden Green and Elad Romanov. The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025

work page 2025
[46]

William F. Massy. Principal components regression in exploratory statistical research.Journal of the American Statistical Association, 60, 1965

work page 1965
[47]

Dhillon, Dean P

Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison of ordinary least squares vs ridge regression.Journal of Machine Learning Research, 14, 2013

work page 2013
[48]

Cambridge series in statistical and probabilistic mathematics

Jianfeng Yao, Zhidong Bai, and Shui-Rong Zheng.Large sample covariance matrices and high-dimensional data analysis. Cambridge series in statistical and probabilistic mathematics. Cambridge university press, 2015

work page 2015
[49]

A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010

work page 2010
[50]

A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020

work page 2020
[51]

Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022

Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, and Lenka Zdeborová. Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022

work page 2022
[52]

Lampinen and Surya Ganguli

Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. InInternational Conference on Learning Representations, 2019

work page 2019
[53]

Rotskoff

Javan Tahir, Surya Ganguli, and Grant M. Rotskoff. Features are fate: a theory of transfer learning in high- dimensional regression. InForty-second International Conference on Machine Learning, 2025

work page 2025
[54]

Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025

Fan Yang, Hongyang R Zhang, Sen Wu, Christopher Re, and Weijie J Su. Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025

work page 2025
[55]

Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024

Yanke Song, Sohom Bhattacharya, and Pragya Sur. Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024

work page arXiv 2024
[56]

Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo

Jason D. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

work page 2021
[57]

Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data

Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[58]

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning.arXiv preprint arXiv:1902.09229, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[59]

Hashimoto

Tianyi Zhang and Tatsunori B. Hashimoto. On the inductive bias of masked language modeling: From statistical to syntactic dependencies. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021. 12 Optimal Representation Size: High-Dimensional Analysis of Pretrain...

work page 2021
[60]

Provable benefits of unsupervised pre-training and transfer learning via single-index models

Taj Jones-Mccormick, Aukosh Jagannath, and Subhabrata Sen. Provable benefits of unsupervised pre-training and transfer learning via single-index models. InProceedings of the 42nd International Conference on Machine Learning, volume 267. PMLR, 2025

work page 2025
[61]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

Jeffrey Johnston, and Stefano Fusi

Bin Wang, W. Jeffrey Johnston, and Stefano Fusi. A mathematical theory for understanding when abstract representations emerge in neural networks.arXiv preprint arXiv:2510.09816, 2026

work page arXiv 2026
[63]

Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine

Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine. A theory of how pretraining shapes inductive bias in fine-tuning.arXiv preprint arXiv:2602.20062, 2026

work page arXiv 2026
[64]

Johnstone

Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of Statistics, 29(2), 2001

work page 2001
[65]

Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005

Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005

work page 2005
[66]

V . A. Marˇcenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1, 1967

work page 1967
[67]

Silverstein.Spectral Analysis of Large Dimensional Random Matrices

Zhidong Bai and Jack W. Silverstein.Spectral Analysis of Large Dimensional Random Matrices. Springer Series in Statistics. Springer New York, 2010

work page 2010
[68]

Eigenvalues of large sample covariance matrices of spiked population models

Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97, 2006

work page 2006
[69]

Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007

Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007

work page 2007
[70]

On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012

Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012

work page 2012
[71]

Learning in the presence of low- dimensional structure: a spiked random matrix perspective

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low- dimensional structure: a spiked random matrix perspective. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[72]

When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020

work page 2020
[73]

Gradient-based feature learning under structured data

Alireza Mousavi-Hosseini, Denny Wu, Taiji Suzuki, and Murat A Erdogdu. Gradient-based feature learning under structured data. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[74]

High-dimensional asymptotics of feature learning: How one gradient step improves the representation

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[75]

Asymptotics of feature learning in two-layer networks after one gradient-step

Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, and Bruno Loureiro. Asymptotics of feature learning in two-layer networks after one gradient-step. InInternational Conference on Machine Learning, 2024

work page 2024
[76]

A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities

Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities. InInternational Conference on Artificial Intelligence and Statistics, pages 2224–2232. PMLR, 2025

work page 2025
[77]

Montúfar

Rishi Sonthalia, Michael Murray, and Guido F. Montúfar. Low rank gradients and where to find them. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[78]

Ribeiro, and Thomas B

Daniel Gedon, Antônio H. Ribeiro, and Thomas B. Schön. No double descent in principal component regression: A high-dimensional analysis. InInternational Conference on Machine Learning, 2024. 13 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

work page 2024
[79]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

work page 2023
[80]

Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

work page arXiv 2022

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

OpenAI et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

work page 2018

[4] [4]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021

[5] [5]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[6] [6]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[7] [7]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022

work page 2022

[8] [8]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning. PMLR, 2020

work page 2020

[9] [9]

Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

Yuejiang Liu, Parth Kothari, Bastien Van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021

[10] [10]

The surprising effectiveness of test-time training for few-shot learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. InInternational Conference on Machine Learning. PMLR, 2025

work page 2025

[11] [11]

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021

work page 2021

[12] [12]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018

work page 2018

[13] [13]

Greedy layer-wise training of deep networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. InAdvances in Neural Information Processing Systems, volume 19, 2006

work page 2006

[14] [14]

Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 11, 2010

work page 2010

[15] [15]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 2010

work page 2010

[16] [16]

Representation learning: A review and new perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 2013

work page 2013

[17] [17]

The representation of object concepts in the brain.Annu

Alex Martin. The representation of object concepts in the brain.Annu. Rev. Psychol., 58, 2007

work page 2007

[18] [18]

Why neurons mix: high dimensionality for higher cognition

Stefano Fusi, Earl K Miller, and Mattia Rigotti. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37, 2016. 10 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

work page 2016

[19] [19]

Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113, 2019

work page 2019

[20] [20]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114, 2017

work page 2017

[21] [21]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 2017

work page 2017

[22] [22]

Exact learning dynamics of deep linear networks with prior knowledge

Lukas Braun, Clémentine Carla Juliette Dominé, James E Fitzgerald, and Andrew M Saxe. Exact learning dynamics of deep linear networks with prior knowledge. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[23] [23]

The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021

David Badre, Apoorva Bhandari, Haley Keglovits, and Atsushi Kikumoto. The dimensionality of neural represen- tations for control.Current Opinion in Behavioral Sciences, 38, 2021

work page 2021

[24] [24]

Boyle, Lorenzo Posani, Sarah Irfan, Steven A

Lara M. Boyle, Lorenzo Posani, Sarah Irfan, Steven A. Siegelbaum, and Stefano Fusi. Tuned geometries of hippocampal representations meet the computational demands of social memory.Neuron, 112, 2024

work page 2024

[25] [25]

Courellis, Juri Minxha, Araceli R

Hristos S. Courellis, Juri Minxha, Araceli R. Cardenas, Daniel L. Kimmel, Chrystal M. Reed, Taufik A. Valiante, C. Daniel Salzman, Adam N. Mamelak, Stefano Fusi, and Ueli Rutishauser. Abstract representations emerge in human hippocampal neurons during inference.Nature, 632, 2024

work page 2024

[26] [26]

Karyna Mishchanchuk, Gabrielle Gregoriou, Albert Qü, Alizée Kastler, Quentin J. M. Huys, Linda Wilbrecht, and Andrew F. MacAskill. Hidden state inference requires abstract contextual representations in the ventral hippocampus.Science, 386, 2024

work page 2024

[27] [27]

Rodgers, Randy M

Ramon Nogueira, Chris C. Rodgers, Randy M. Bruno, and Stefano Fusi. The geometry of cortical representations of touch in rodents.Nature Neuroscience, 26, 2023

work page 2023

[28] [28]

The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[29] [29]

High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018

Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classifica- tion.The Annals of Statistics, 2018

work page 2018

[30] [30]

Tibshirani

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.The Annals of Statistics, 50, 2022

work page 2022

[31] [31]

Asymptotics of ridge(less) regression under general source condition

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) regression under general source condition. InInternational Conference on Artificial Intelligence and Statistics, 2020

work page 2020

[32] [32]

On the optimal weighted \ell_2 regularization in overparameterized linear regression

Denny Wu and Ji Xu. On the optimal weighted \ell_2 regularization in overparameterized linear regression. In Advances in Neural Information Processing Systems, volume 33, 2020

work page 2020

[33] [33]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117, 2020

work page 2020

[34] [34]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116, 2019

work page 2019

[35] [35]

Advani, Andrew M

Madhu S. Advani, Andrew M. Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132, 2020

work page 2020

[36] [36]

The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026

Qiyang Han and Xiaocong Xu. The distribution of ridgeless least squares interpolators.Journal of Machine Learning Research, 27, 2026

work page 2026

[37] [37]

Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization

Simone Bombari and Marco Mondelli. Spurious correlations in high dimensional regression: The roles of regularization, simplicity bias and over-parameterization. InInternational Conference on Machine Learning, 2025

work page 2025

[38] [38]

Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025

Edwige Cyffers, Alireza Mirrokni, and Marco Mondelli. Optimal regularization for performative learning.arXiv preprint arXiv:2510.12249, 2025. 11 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

work page arXiv 2025

[39] [39]

High-dimensional analysis of synthetic data selection

Parham Rezaei, Filip Kovacevic, Francesco Locatello, and Marco Mondelli. High-dimensional analysis of synthetic data selection. InInternational Conference on Learning Representations, 2026

work page 2026

[40] [40]

Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak

M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong generalization and scaling laws. InInternational Conference on Learning Representations, 2025

work page 2025

[41] [41]

Towards a statistical theory of data selection under weak supervision

Germain Kolossov, Andrea Montanari, and Pulkit Tandon. Towards a statistical theory of data selection under weak supervision. InInternational Conference on Learning Representations, 2024

work page 2024

[42] [42]

Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024

Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data.Advances in Neural Information Processing Systems, 37, 2024

work page 2024

[43] [43]

Jolliffe

Ian T. Jolliffe. A Note on the Use of Principal Components in Regression.Applied Statistics, 31, 1982

work page 1982

[44] [44]

On the number of variables to use in principal component regression

Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[45] [45]

The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025

Alden Green and Elad Romanov. The high-dimensional asymptotics of principal component regression.The Annals of Statistics, 53, 2025

work page 2025

[46] [46]

William F. Massy. Principal components regression in exploratory statistical research.Journal of the American Statistical Association, 60, 1965

work page 1965

[47] [47]

Dhillon, Dean P

Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison of ordinary least squares vs ridge regression.Journal of Machine Learning Research, 14, 2013

work page 2013

[48] [48]

Cambridge series in statistical and probabilistic mathematics

Jianfeng Yao, Zhidong Bai, and Shui-Rong Zheng.Large sample covariance matrices and high-dimensional data analysis. Cambridge series in statistical and probabilistic mathematics. Cambridge university press, 2015

work page 2015

[49] [49]

A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22, 2010

work page 2010

[50] [50]

A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning.Proceedings of the IEEE, PP, 2020

work page 2020

[51] [51]

Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022

Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, and Lenka Zdeborová. Probing transfer learning with a model of synthetic correlated datasets.Machine Learning: Science and Technology, 2022

work page 2022

[52] [52]

Lampinen and Surya Ganguli

Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. InInternational Conference on Learning Representations, 2019

work page 2019

[53] [53]

Rotskoff

Javan Tahir, Surya Ganguli, and Grant M. Rotskoff. Features are fate: a theory of transfer learning in high- dimensional regression. InForty-second International Conference on Machine Learning, 2025

work page 2025

[54] [54]

Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025

Fan Yang, Hongyang R Zhang, Sen Wu, Christopher Re, and Weijie J Su. Precise high-dimensional asymptotics for quantifying heterogeneous transfers.Journal of Machine Learning Research, 26, 2025

work page 2025

[55] [55]

Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024

Yanke Song, Sohom Bhattacharya, and Pragya Sur. Generalization error of min-norm interpolators in transfer learning.arXiv preprint arXiv:2406.13944, 2024

work page arXiv 2024

[56] [56]

Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo

Jason D. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

work page 2021

[57] [57]

Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data

Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable benefits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[58] [58]

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning.arXiv preprint arXiv:1902.09229, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[59] [59]

Hashimoto

Tianyi Zhang and Tatsunori B. Hashimoto. On the inductive bias of masked language modeling: From statistical to syntactic dependencies. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021. 12 Optimal Representation Size: High-Dimensional Analysis of Pretrain...

work page 2021

[60] [60]

Provable benefits of unsupervised pre-training and transfer learning via single-index models

Taj Jones-Mccormick, Aukosh Jagannath, and Subhabrata Sen. Provable benefits of unsupervised pre-training and transfer learning via single-index models. InProceedings of the 42nd International Conference on Machine Learning, volume 267. PMLR, 2025

work page 2025

[61] [61]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[62] [62]

Jeffrey Johnston, and Stefano Fusi

Bin Wang, W. Jeffrey Johnston, and Stefano Fusi. A mathematical theory for understanding when abstract representations emerge in neural networks.arXiv preprint arXiv:2510.09816, 2026

work page arXiv 2026

[63] [63]

Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine

Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, and Clementine Domine. A theory of how pretraining shapes inductive bias in fine-tuning.arXiv preprint arXiv:2602.20062, 2026

work page arXiv 2026

[64] [64]

Johnstone

Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of Statistics, 29(2), 2001

work page 2001

[65] [65]

Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005

Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.The Annals of Probability, 33, 2005

work page 2005

[66] [66]

V . A. Marˇcenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1, 1967

work page 1967

[67] [67]

Silverstein.Spectral Analysis of Large Dimensional Random Matrices

Zhidong Bai and Jack W. Silverstein.Spectral Analysis of Large Dimensional Random Matrices. Springer Series in Statistics. Springer New York, 2010

work page 2010

[68] [68]

Eigenvalues of large sample covariance matrices of spiked population models

Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97, 2006

work page 2006

[69] [69]

Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007

Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.Statistica Sinica, 2007

work page 2007

[70] [70]

On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012

Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model.Journal of Multivariate Analysis, 106, 2012

work page 2012

[71] [71]

Learning in the presence of low- dimensional structure: a spiked random matrix perspective

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low- dimensional structure: a spiked random matrix perspective. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[72] [72]

When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? InAdvances in Neural Information Processing Systems, volume 33, 2020

work page 2020

[73] [73]

Gradient-based feature learning under structured data

Alireza Mousavi-Hosseini, Denny Wu, Taiji Suzuki, and Murat A Erdogdu. Gradient-based feature learning under structured data. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[74] [74]

High-dimensional asymptotics of feature learning: How one gradient step improves the representation

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[75] [75]

Asymptotics of feature learning in two-layer networks after one gradient-step

Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, and Bruno Loureiro. Asymptotics of feature learning in two-layer networks after one gradient-step. InInternational Conference on Machine Learning, 2024

work page 2024

[76] [76]

A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities

Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities. InInternational Conference on Artificial Intelligence and Statistics, pages 2224–2232. PMLR, 2025

work page 2025

[77] [77]

Montúfar

Rishi Sonthalia, Michael Murray, and Guido F. Montúfar. Low rank gradients and where to find them. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[78] [78]

Ribeiro, and Thomas B

Daniel Gedon, Antônio H. Ribeiro, and Thomas B. Schön. No double descent in principal component regression: A high-dimensional analysis. InInternational Conference on Machine Learning, 2024. 13 Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear ProbingA PREPRINT

work page 2024

[79] [79]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

work page 2023

[80] [80]

Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

work page arXiv 2022