What learning algorithm is in-context learning? Investigations with linear models
Pith reviewed 2026-05-17 13:34 UTC · model grok-4.3
pith:PFZHD3TF Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{PFZHD3TF}
Prints a linked pith:PFZHD3TF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Transformers implement gradient descent and ridge regression implicitly when doing in-context learning on linear tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths.
What carries the argument
Implicit linear models encoded in transformer activations that are updated by new labeled examples appearing in the context.
If this is right
- Transformers can be explicitly constructed to run gradient descent or closed-form ridge regression on linear models.
- Trained in-context models reproduce the outputs of gradient descent, ridge regression, and exact least squares on held-out points.
- The effective algorithm changes with network depth and with the noise level in the training examples.
- Late layers of trained transformers non-linearly encode weight vectors and moment matrices.
- Very wide and deep models converge to Bayesian posterior means rather than to point estimates.
Where Pith is reading between the lines
- If the same algorithmic alignment appears outside the linear setting, in-context learning may amount to rediscovery of classical estimators rather than invention of new ones.
- Prompt design could be guided by choosing examples that steer an implicit ridge or least-squares procedure toward a desired bias-variance trade-off.
- Measuring whether late-layer activations continue to track weight vectors on non-linear tasks would test the scope of the linear-model analogy.
Load-bearing premise
Results obtained on linear regression will carry over to the non-linear tasks that dominate real language-model in-context learning.
What would settle it
A controlled experiment in which trained transformers produce predictions on linear tasks that deviate consistently from every standard regression algorithm even after training converges.
read the original abstract
Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms. Code and reference implementations are released at https://github.com/ekinakyurek/google-research/blob/master/incontext.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates the hypothesis that transformer-based in-context learners implement standard learning algorithms (gradient descent, ridge regression, least-squares) implicitly for linear regression by encoding and updating smaller models in their activations. It offers three lines of evidence: explicit constructions proving transformers can realize GD and closed-form ridge regression; experiments demonstrating that trained in-context learners produce predictors that closely match those of GD, ridge regression, and exact least-squares (with transitions across regimes of depth and noise, and convergence to Bayesian estimators at large width/depth); and preliminary evidence that late layers non-linearly encode weight vectors and moment matrices.
Significance. If the central results hold, the work supplies a concrete algorithmic account of in-context learning in the linear setting and demonstrates that trained transformers can rediscover classical estimators. Notable strengths include the parameter-free constructions, quantitative predictor matches backed by released code, and the focus on falsifiable comparisons rather than post-hoc fitting. These elements make the linear-case findings verifiable and extensible.
major comments (2)
- [§3] §3 (Constructions): the explicit constructions establish that transformers are capable of implementing GD and ridge regression, but the link to trained models rests on output matching rather than a demonstration that the learned weights realize the same internal update rules; this gap is load-bearing for the stronger claim that learners 'implement' the algorithms implicitly.
- [§4.2–4.3] §4.2–4.3 (Predictor matching and regime transitions): while experiments report close quantitative agreement with GD/ridge/least-squares and transitions with depth/noise, the manuscript does not provide a theoretical account of the selection mechanism; without it, the observed transitions remain descriptive and could be consistent with other implicit algorithms.
minor comments (3)
- [Notation] Notation for implicit model states and moment matrices should be introduced with a single consolidated table to reduce cross-reference burden.
- [Figures] Figure captions for the encoding plots (late-layer activations) should explicitly state the dimensionality and normalization used for the weight-vector and moment-matrix visualizations.
- [Abstract] The abstract's phrasing 'converging to Bayesian estimators' should be qualified with the precise prior and the scaling regime (width/depth) under which the convergence is observed.
Simulated Author's Rebuttal
We thank the referee for the careful review and the recommendation of minor revision. We address the two major comments below, clarifying the scope of our claims and noting where we will revise the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Constructions): the explicit constructions establish that transformers are capable of implementing GD and ridge regression, but the link to trained models rests on output matching rather than a demonstration that the learned weights realize the same internal update rules; this gap is load-bearing for the stronger claim that learners 'implement' the algorithms implicitly.
Authors: We agree that the constructions show architectural capacity rather than proving that the learned weights exactly replicate the internal update rules of GD or ridge regression. Our stronger claim is supported by the combination of (i) the existence proofs, (ii) the close quantitative predictor matches across many regimes, and (iii) the preliminary representational evidence that late-layer activations encode weight vectors and moment matrices. We do not claim to have performed a full mechanistic interpretability analysis of the trained weights. We will revise the abstract, §3, and the discussion to moderate the phrasing from “implement … implicitly” to “can implement … and trained models produce equivalent predictors,” and we will add a short paragraph noting the distinction between capacity, behavioral equivalence, and internal mechanism. revision: partial
-
Referee: [§4.2–4.3] §4.2–4.3 (Predictor matching and regime transitions): while experiments report close quantitative agreement with GD/ridge/least-squares and transitions with depth/noise, the manuscript does not provide a theoretical account of the selection mechanism; without it, the observed transitions remain descriptive and could be consistent with other implicit algorithms.
Authors: We accept this observation. The paper is primarily empirical: it documents the quantitative agreement, the systematic transitions with depth and noise, and the convergence to Bayesian estimators at large width/depth. No theoretical derivation of the selection mechanism that chooses among GD, ridge, or least-squares in different regimes is supplied. We will add a concise limitations paragraph in the discussion that acknowledges this gap and lists it as an open direction for future theoretical work. revision: yes
Circularity Check
No significant circularity; derivations rely on explicit constructions and external algorithm comparisons
full rationale
The paper's central results consist of a proof by construction showing that transformers can implement gradient descent and closed-form ridge regression on linear models, followed by empirical matching of trained in-context learners to the predictors produced by these standard algorithms plus exact least-squares regression. All comparisons are to externally defined, well-known methods whose definitions and implementations do not depend on quantities fitted or defined inside this work. No load-bearing premise reduces to a self-citation, a fitted parameter renamed as a prediction, or a redefinition of inputs; the linear-regression setting is treated explicitly as a prototypical case rather than smuggled in as a universal claim. The derivation chain is therefore self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformers can encode smaller models in their activations and update these implicit models as new examples appear in the context.
Forward citations
Cited by 17 Pith papers
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...
-
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Spectral Transformer Neural Processes
STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
-
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning
Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.
-
When Context Sticks: Studying Interference in In-Context Learning
In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
-
Online In-Context Distillation for Low-Resource Vision Language Models
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. ArXiv preprint, abs/1610.01644, 2016. URL https://arxiv.org/abs/1610.01644
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Hoffman, David Pfau, Tom Schaul, and Nando de Freitas
Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on...
work page 2016
-
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv preprint, abs/1607.06450, 2016. URL https://arxiv.org/abs/1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[5]
Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: circuits. Distill, 5 0 (3): 0 e24, 2020
work page 2020
-
[6]
Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, and Dawei Yin
Stephanie CY Chan, Adam Santoro, Andrew K Lampinen, Jane X Wang, Aaditya Singh, Pierre H Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent few-shot learning in transformers. ArXiv preprint, abs/2205.05055, 2022. URL https://arxiv.org/abs/2205.05055
-
[7]
Mask-align: Self-supervised neural word alignment
Chi Chen, Maosong Sun, and Yang Liu. Mask-align: Self-supervised neural word alignment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4781--4791, Online, 2021. Association for Computational Linguistics. doi:...
-
[8]
Meta-learning via language model in-context tuning
Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 719--730, Dublin, Ireland, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-long.53. URL https:/...
-
[9]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research, pp.\ 1126--1135. PMLR ...
work page 2017
-
[11]
What can transformers learn in-context? a case study of simple function classes
Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. ArXiv, abs/2208.01066, 2022
-
[12]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ArXiv preprint, abs/1606.08415, 2016. URL https://arxiv.org/abs/1606.08415
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
Ridge regression: Biased estimation for nonorthogonal problems
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12 0 (1): 0 55--67, 1970
work page 1970
-
[14]
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second, 2022. URL https://arxiv.org/abs/2207.01848
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Multilayer feedforward networks are universal approximators
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2 0 (5): 0 359--366, 1989
work page 1989
-
[16]
Meta learning backpropagation and improving it
Louis Kirsch and J \"u rgen Schmidhuber. Meta learning backpropagation and improving it. Advances in Neural Information Processing Systems, 34: 0 14122--14134, 2021
work page 2021
-
[17]
In-context reinforcement learning with algorithm distillation
Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinforcement learning with algorithm distillation. ArXiv preprint, abs/2210.14215, 2022. URL https://arxiv.org/abs/2210.14215
-
[18]
David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010
work page 2010
-
[19]
M eta ICL : Learning to learn in context
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. M eta ICL : Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2791--2809, Seattle, United States, 2022. Association for Computational Linguistics. doi:10.18653...
-
[20]
Compositional explanations of neurons
Jesse Mu and Jacob Andreas. Compositional explanations of neurons. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://...
work page 2020
-
[21]
Transformers can do bayesian inference.arXiv preprint arXiv:2112.10510, 2021
Samuel M \"u ller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. arXiv preprint arXiv:2112.10510, 2021
-
[22]
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, T. J. Henighan, Benjamin Mann, Amanda Askell, Yushi Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, John Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom B. Brown, Jack Clark, Jared Kaplan, Sam McCandl...
work page 2022
-
[23]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and J \" u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 9355--9366. PMLR , 2021. URL...
work page 2021
-
[24]
Simple principles of metalearning
Juergen Schmidhuber, Jieyu Zhao, and Marco A Wiering. Simple principles of metalearning. 1996
work page 1996
-
[25]
Adjustment of an inverse matrix corresponding to a change in one element of a given matrix
Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21 0 (1): 0 124--127, 1950
work page 1950
-
[26]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...
work page 2017
-
[27]
Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. ArXiv preprint, abs/2107.13163, 2021. URL https://arxiv.org/abs/2107.13163
-
[29]
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. ArXiv, abs/2111.02080, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr
work page 2020
-
[31]
Opt: Open pre-trained transformer language models, 2022
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022
work page 2022
-
[32]
Learning to prompt for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.