Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Charles Kulick; Dongyang Li; Mingsong Yan; Sui Tang

arxiv: 2605.08475 · v2 · pith:CP2AQAENnew · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.NA· math.NA· math.OC

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Mingsong Yan , Dongyang Li , Charles Kulick , Sui Tang This is my paper

Pith reviewed 2026-05-20 22:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAmath.OC

keywords in-context learningkernel ridge regressionGaussian kernelspreconditioned Richardson iterationsoftmax attentionmechanistic interpretabilitytransformer architectureiterative solvers

0 comments

The pith

A standard single-head transformer approximates the Gaussian kernel ridge regression predictor by implementing preconditioned Richardson iteration across its layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a transformer with softmax attention can carry out in-context Gaussian kernel ridge regression by simulating a classical iterative solver during its forward pass. Under the assumption of bounded data, the authors construct a network with logarithmic depth in the target accuracy and width scaling as the square root of prompt length over accuracy. Attention layers compute the necessary normalized kernel interactions between tokens, while the MLP layers perform the local scalar arithmetic that advances each iteration step. This supplies an explicit, convergent mechanism with end-to-end error bounds for a nonlinear prediction task that prior work had not shown standard transformers can realize. The construction therefore links the architecture's basic components directly to a reliable solver for kernel-based in-context learning.

Core claim

A standard single-head transformer with softmax attention implements preconditioned Richardson iteration on the kernel linear system associated with Gaussian-kernel ridge regression. Under bounded-data assumptions the construction uses O(log(1/ε)) blocks and MLP width O(√(N/ε)) to produce an ε-accurate predictor for an N-token prompt. Softmax attention realizes the row-normalized Gaussian-kernel operator required for cross-token updates, while ReLU MLP layers approximate the intra-token arithmetic of each Richardson step. The same architecture, when trained on Gaussian-process regression tasks, yields layer-wise predictions whose error profiles align with the classical solver rather than un-

What carries the argument

Preconditioned Richardson iteration on the kernel Gram matrix, realized by the transformer's attention operator for inter-token kernel multiplication and by its MLP layers for the local preconditioned update rule.

If this is right

Arbitrary accuracy is achievable by adding a logarithmic number of layers rather than increasing width polynomially.
The functional split between attention (global kernel operator) and MLP (local arithmetic) supplies a concrete decomposition of how the transformer solves the nonlinear task.
The same mechanism extends the known linear-regression accounts of in-context learning to kernel methods with positive-definite kernels.
Trained transformers on related tasks exhibit layer-wise behavior consistent with the iterative solver rather than with direct closed-form evaluation.
Ablation of either the attention normalization or the MLP update breaks the match to the classical iteration trajectory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar layer-wise alignments could be tested on other positive-definite kernels to see whether the same iteration template generalizes.
The logarithmic depth bound suggests that explicit iterative modules inserted into transformers might achieve comparable accuracy with fewer total parameters.
If the bounded-norm assumption is relaxed, the required depth or width may grow with data scale, offering a testable prediction for performance on unbounded or high-norm inputs.
The construction provides a concrete target for probing whether naturally trained models internally approximate this or a related solver on kernel regression prompts.

Load-bearing premise

The data points remain bounded in norm so that the iteration converges at a rate independent of prompt length and the approximation errors stay controlled.

What would settle it

On a concrete bounded dataset whose kernel matrix is known exactly, the successive internal representations of the transformer do not match the successive iterates produced by running preconditioned Richardson iteration to the same tolerance.

Figures

Figures reproduced from arXiv: 2605.08475 by Charles Kulick, Dongyang Li, Mingsong Yan, Sui Tang.

**Figure 1.** Figure 1: Layer-wise transformer errors align with step-wise preconditioned Richardson errors. Spherical inputs, maximum context length N = 40. All panels share the same y-axis (MSE, log scale) over context lengths n∈ {2, 10, 15, 20, 25, 30, 35, 40}; the dashed horizontal line in every panel marks the transformer’s final-layer MSE at n = 40 (the empirical floor). The x-axes differ: transformer panel (a) covers laye… view at source ↗

**Figure 2.** Figure 2: Error similarity visualizations for spherical data. (a) SimE heatmap between transformer layer-wise outputs and preconditioned Richardson step-wise outputs; yellow boxes mark the best-matching step t ∗ (ℓ) per layer, with inset numbers denoting the peak cosine similarity. (b) Argmax trajectory t ∗ (ℓ) for four classical algorithms; only preconditioned Richardson exhibits a steady linear progression across… view at source ↗

**Figure 3.** Figure 3: Best-matching step per transformer layer for Uniform and Gaussian inputs, supple [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗

**Figure 4.** Figure 4: MSE convergence for (a) Uniform and (b) Gaussian input distributions, supplementing [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗

**Figure 5.** Figure 5: SimE heatmaps compare cosine similarity between transformer per-layer error vectors [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗

**Figure 6.** Figure 6: MSE convergence plots for the spherical distribution for [PITH_FULL_IMAGE:figures/full_fig_p038_6.png] view at source ↗

**Figure 7.** Figure 7: SimE heatmaps for the 24-layer depth ablation using the spherical distribution and [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗

**Figure 8.** Figure 8: MSE convergence and argmax step maps for linear attention under (a) Uniform, (b) [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: Heads and widths ablation studies on 12-layer transformers trained on spherical GP [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗

**Figure 10.** Figure 10: Noise-mismatch ablation: three regime snapshots and the full sweep. All MSEs are reported relative to Bayes-optimal converged KRR (dashed line at 1.0); values above 1.0 are over-regularized relative to Bayes-optimal. Top row (regime snapshots): (a) σtest ≪ σtrain: both transformer (orange) and encoded oracle (blue) sit above Bayes-optimal, as neither adapts λ down for cleaner test data, and the transforme… view at source ↗

read the original abstract

Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/\epsilon))$ blocks and MLP width $O(\sqrt{N/\epsilon})$ that achieves $\epsilon$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete construction for a softmax transformer that runs preconditioned Richardson iteration on Gaussian KRR, but the bounded-data assumptions may not deliver a uniform contraction rate independent of point configuration.

read the letter

The main takeaway is that the authors construct a single-head transformer whose forward pass approximates the solution to in-context Gaussian kernel ridge regression by executing a few steps of preconditioned Richardson iteration. Softmax attention produces the row-normalized kernel operator for cross-token mixing, while the ReLU MLP layers handle the local scalar arithmetic for each update. Under bounded-data assumptions they claim O(log(1/ε)) blocks and MLP width O(√(N/ε)) suffice for ε-accurate predictions on length-N prompts. They also train GPT-2-style models on Gaussian-process tasks and use linear probing to show that layer-wise outputs align more closely with Richardson steps than with other classical solvers, backed by ablations. That identification of a specific convergent algorithm is new relative to earlier ICL work on linear regression or functional gradients. The construction is explicit enough to be checked, and the probing gives a practical way to test the claimed mechanism on trained networks. The soft spot is the convergence guarantee. Bounded norms control individual kernel entries, yet nothing in the abstract rules out data configurations with tight clusters that drive some eigenvalues of the kernel matrix near zero; in those cases the iteration matrix can still have spectral radius approaching 1, so the number of blocks needed may grow with the particular point set rather than staying O(log(1/ε)) uniformly. The paper invokes the bounded-data assumptions to secure convergence, but an explicit spectral-gap argument that δ stays bounded away from zero across all such sets is not visible from the summary. This work is aimed at researchers who want mechanistic accounts of nonlinear in-context learning. It supplies a specific, testable algorithm that standard transformers can realize, which is useful for guiding architecture choices even if the bounds require tightening. The combination of construction and empirical alignment is solid enough to justify sending it to peer review; referees can verify the derivation and check whether the contraction rate holds as stated.

Referee Report

2 major / 2 minor

Summary. The paper claims that a standard single-head softmax-attention transformer can implement preconditioned Richardson iteration to solve in-context Gaussian kernel ridge regression. Under bounded-data assumptions, a construction is given for a transformer with O(log(1/ε)) blocks and MLP width O(√(N/ε)) that achieves ε-accurate predictions on N-point prompts. The construction decomposes the task into softmax attention realizing a row-normalized kernel operator and ReLU MLPs handling local scalar updates; empirical probing of trained GPT-2-style models shows layer-wise outputs aligning with the iteration steps rather than other solvers.

Significance. If the construction and associated error bounds are rigorous, the result supplies a concrete algorithmic mechanism for nonlinear ICL that extends prior linear-regression accounts and identifies a functional role for standard transformer components. The explicit reduction to a classical convergent iteration plus the probing experiments constitute a clear strength; the bounded-data hypothesis is the main point requiring verification for the claimed depth and width scalings to hold uniformly.

major comments (2)

[§4] §4 (Construction and convergence analysis): the O(log(1/ε)) block bound requires the iteration matrix of the preconditioned Richardson scheme to have spectral radius ≤ 1-δ with δ > 0 independent of N. Boundedness of the data points controls individual kernel entries but does not automatically preclude configurations (e.g., arbitrarily close clusters) in which the largest eigenvalue of the normalized operator approaches 1; no explicit uniform spectral-gap argument is supplied.
[Theorem 4.1] Theorem 4.1 (Transformer approximation error): the MLP width O(√(N/ε)) bound for approximating the intra-token arithmetic steps appears to rely on Lipschitz constants that may implicitly depend on the conditioning of the kernel matrix K; it is unclear whether the bounded-data assumption alone yields a uniform bound on this conditioning.

minor comments (2)

[§5] The probing experiments in §5 would be strengthened by reporting variance across random seeds or data realizations to confirm that the alignment with preconditioned Richardson is robust rather than an artifact of a particular training run.
[Eq. (8)] Notation for the preconditioner (around Eq. (8)) should explicitly state whether it is realized purely by the attention weights or requires additional learned parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify places where the bounded-data assumption must be leveraged more explicitly to confirm uniformity of the spectral gap and Lipschitz constants. We address both points below with additional arguments that can be added to the manuscript without altering the core claims or constructions.

read point-by-point responses

Referee: [§4] §4 (Construction and convergence analysis): the O(log(1/ε)) block bound requires the iteration matrix of the preconditioned Richardson scheme to have spectral radius ≤ 1-δ with δ > 0 independent of N. Boundedness of the data points controls individual kernel entries but does not automatically preclude configurations (e.g., arbitrarily close clusters) in which the largest eigenvalue of the normalized operator approaches 1; no explicit uniform spectral-gap argument is supplied.

Authors: We agree that an explicit uniform spectral-gap argument strengthens the presentation. Under the bounded-data assumption (all prompt points lie in a fixed ball of radius R), the Gaussian kernel satisfies m ≤ k(x_i, x_j) ≤ M for constants m, M > 0 depending only on R and the kernel bandwidth. The row-normalized operator P = D^{-1}K is row-stochastic with entries satisfying α/N ≤ p_{ij} ≤ (1/α)/N where α = m/M > 0 is independent of N. For any subset S, the conductance satisfies Φ(S) ≥ c(α) > 0 independent of N and the particular configuration, because every cross term is at least α/N and there are Θ(N) possible cross pairs. Cheeger’s inequality then yields a spectral gap 1 − λ_2(P) ≥ δ(α) > 0 independent of N. Consequently the iteration matrix of preconditioned Richardson iteration has spectral radius ≤ 1 − δ with the same δ, delivering the O(log(1/ε)) block bound uniformly. We will insert this conductance argument as a new lemma in §4. revision: yes
Referee: [Theorem 4.1] Theorem 4.1 (Transformer approximation error): the MLP width O(√(N/ε)) bound for approximating the intra-token arithmetic steps appears to rely on Lipschitz constants that may implicitly depend on the conditioning of the kernel matrix K; it is unclear whether the bounded-data assumption alone yields a uniform bound on this conditioning.

Authors: The bounded-data assumption likewise yields uniform control on all scalar quantities appearing in the intra-token updates. Because m ≤ k_{ij} ≤ M, every row sum lies in [Nm, NM] and the normalized entries remain in [α/N, (1/α)/N]. The intermediate values arising in the Richardson updates (residuals, scaled additions, and divisions by bounded denominators) are therefore bounded by constants depending only on m, M, the target accuracy ε, and the iteration count, all independent of N and data configuration. The functions being approximated by the ReLU MLPs therefore have Lipschitz constants uniform in N. Standard approximation results for ReLU networks then give the stated width O(√(N/ε)) with constants independent of the particular kernel matrix. We will add a short remark after Theorem 4.1 making this uniformity explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive mapping from iteration to transformer components is self-contained

full rationale

The paper's central result is an explicit construction showing how a single-head softmax transformer with specified depth and width can realize preconditioned Richardson iteration for in-context Gaussian KRR. The derivation specifies the functional roles of attention (producing the row-normalized kernel operator) and MLP layers (approximating intra-token arithmetic) under bounded-data assumptions that control norms and approximation error, without any equation reducing a claimed prediction or convergence rate to a fitted parameter, self-referential definition, or load-bearing self-citation. No ansatz is smuggled via prior work, no uniqueness theorem is invoked to force the choice, and the empirical probing section compares layer outputs to classical solver steps without circular fitting. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on standard kernel-method convergence assumptions plus the bounded-data condition needed for the width and depth bounds; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption bounded-data assumptions
Invoked to guarantee that the preconditioned Richardson iteration converges and that the transformer approximation error can be bounded by the stated O(log(1/ε)) depth and O(√(N/ε)) width.

pith-pipeline@v0.9.0 · 5842 in / 1287 out tokens · 46347 ms · 2026-05-20T22:21:04.911525+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Transformers learn to implement preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. In Advances in Neural Information Processing Systems, volume 36, pages 45614--45650, 2023

work page 2023
[2]

What learning algorithm is in-context learning? investigations with linear models

Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[3]

In-context language learning: Architectures and algorithms

Ekin Aky \"u rek, Bailin Wang, Yoon Kim, and Jacob Andreas. In-context language learning: Architectures and algorithms. In International Conference on Machine Learning. PMLR, 2024

work page 2024
[4]

Breaking the curse of dimensionality with convex neural networks

Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18 0 (19): 0 1--53, 2017

work page 2017
[5]

Learning Theory from First Principles

Francis Bach. Learning Theory from First Principles. MIT Press, 2024

work page 2024
[6]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Advances in Neural Information Processing Systems, volume 36, pages 57125--57211, 2023

work page 2023
[7]

Understanding neural networks with reproducing kernel B anach spaces

Francesca Bartolucci, Ernesto De Vito, Lorenzo Rosasco, and Stefano Vigogna. Understanding neural networks with reproducing kernel B anach spaces. Applied and Computational Harmonic Analysis, 62: 0 194--236, 2023

work page 2023
[8]

Understanding in-context learning in transformers and LLM s by learning to learn discrete functions

Satwik Bhattamishra, Arkil Patel, Phil Blunsom, and Varun Kanade. Understanding in-context learning in transformers and LLM s by learning to learn discrete functions. In International Conference on Learning Representations, 2024

work page 2024
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877--1901, 2020

work page 1901
[10]

Optimal rates for the regularized least-squares algorithm

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7 0 (3): 0 331--368, 2007

work page 2007
[11]

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. In The Thirty Seventh Annual Conference on Learning Theory, volume 247, pages 4573--4573. PMLR, 2024

work page 2024
[12]

Transformers implement functional gradient descent to learn non-linear functions in context

Xiang Cheng, Yuxin Chen, and Suvrit Sra. Transformers implement functional gradient descent to learn non-linear functions in context. In International Conference on Machine Learning, volume 235, pages 8002--8037. PMLR, 2024

work page 2024
[13]

Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, 2009

work page 2009
[14]

In-context learning with transformers: Softmax attention adapts to function lipschitzness

Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, and Sanjay Shakkottai. In-context learning with transformers: Softmax attention adapts to function lipschitzness. In Advances in Neural Information Processing Systems, volume 37, pages 92638--92696, 2024

work page 2024
[15]

A practical guide to splines, volume 27

Carl De Boor. A practical guide to splines, volume 27. springer New York, 1978

work page 1978
[16]

Asymptotic study of in-context learning with random transformers through equivalent models, 2025

Samet Demir and Zafer Dogan. Asymptotic study of in-context learning with random transformers through equivalent models, 2025

work page 2025
[17]

Nonlinear approximation

Ronald DeVore. Nonlinear approximation. Acta numerica, 7: 0 51--150, 1998

work page 1998
[18]

Neural network approximation

Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30: 0 327--444, 2021

work page 2021
[19]

Saxe, and Aaditya K

Sara Dragutinovi \'c , Andrew M. Saxe, and Aaditya K. Singh. Softmax linear: Transformers may learn to classify in-context by kernel gradient descent. arXiv preprint arXiv:2510.10425, 2025

work page arXiv 2025
[20]

Transformers learn to achieve second-order convergence rates for in-context linear regression

Deqing Fu, Tian-Qi Chen, Robin Jia, and Vatsal Sharan. Transformers learn to achieve second-order convergence rates for in-context linear regression. In Advances in Neural Information Processing Systems, volume 37, pages 98675--98716, 2024

work page 2024
[21]

Liang, and Gregory Valiant

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems, volume 35, pages 30583--30598, 2022

work page 2022
[22]

Understanding emergent in-context learning from a kernel regression perspective

Chi Han, Ziqi Wang, Han Zhao, and Heng Ji. Understanding emergent in-context learning from a kernel regression perspective. Transactions on Machine Learning Research, 2025

work page 2025
[23]

In-context convergence of transformers

Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 19660--19722. PMLR, 2024

work page 2024
[24]

Klusowski, Jianqing Fan, and Mengdi Wang

Zihao Li, Yuan Cao, Cheng Gao, Yihang He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-layer transformer provably learns one-nearest neighbor in context. In Advances in Neural Information Processing Systems, volume 37, pages 82166--82204, 2024

work page 2024
[25]

Lu, Mary I

Yue M. Lu, Mary I. Letey, Jacob A. Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymptotic theory of in-context learning by linear attention. Proceedings of the National Academy of Sciences, 122 0 (28): 0 e2502599122, 2025

work page 2025
[26]

Mahankali, Tatsunori B

Arvind V. Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. In International Conference on Learning Representations, 2024

work page 2024
[27]

Banach space representer theorems for neural networks and ridge splines

Rahul Parhi and Robert D Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22 0 (43): 0 1--40, 2021

work page 2021
[28]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019

work page 2019
[29]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006
[30]

A Theoretical Introduction to Numerical Analysis

Victor Ryaben’kii and Semyon Tsynkov. A Theoretical Introduction to Numerical Analysis. Chapman & Hall, 2006

work page 2006
[31]

Towards understanding the universality of transformers for next-token prediction

Michael E Sander and Gabriel Peyr \'e . Towards understanding the universality of transformers for next-token prediction. arXiv preprint arXiv:2410.03011, 2024

work page arXiv 2024
[32]

A generalized representer theorem

Bernhard Sch \"o lkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pages 416--426. Springer, 2001

work page 2001
[33]

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

Zhaiming Shen, Alexander Hsu, Rongjie Lai, and Wenjing Liao. Understanding in-context learning on structured manifolds: Bridging attention to kernel methods. arXiv preprint arXiv:2506.10959, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

An introduction to the conjugate gradient method without the agonizing pain

Jonathan R Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. 1994

work page 1994
[35]

On the role of transformer feed-forward layers in nonlinear in-context learning, 2025

Haoyuan Sun, Ali Jadbabaie, and Navid Azizan. On the role of transformer feed-forward layers in nonlinear in-context learning, 2025

work page 2025
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[38]

Linear transformers are versatile in-context learners

Max Vladymyrov, Johannes von Oswald, Mark Sandler, and Rong Ge. Linear transformers are versatile in-context learners. In Advances in Neural Information Processing Systems, volume 37, pages 48784--48809, 2024

work page 2024
[39]

Transformers learn in-context by gradient descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023

work page 2023
[40]

Sparse representer theorems for learning in reproducing kernel B anach spaces

Rui Wang, Yuesheng Xu, and Mingsong Yan. Sparse representer theorems for learning in reproducing kernel B anach spaces. Journal of Machine Learning Research, 25 0 (93): 0 1--45, 2024

work page 2024
[41]

Hypothesis spaces for deep learning

Rui Wang, Yuesheng Xu, and Mingsong Yan. Hypothesis spaces for deep learning. Neural Networks, page 107995, 2025

work page 2025
[42]

Bartlett

Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024 a

work page 2024
[43]

Bartlett

Ruiqi Zhang, Jingfeng Wu, and Peter L. Bartlett. In-context learning of a linear transformer block: Benefits of the MLP component and one-step GD initialization. In Advances in Neural Information Processing Systems, volume 37, pages 18310--18361, 2024 b

work page 2024
[44]

Training dynamics of in-context learning in linear attention

Yedi Zhang, Freya Behrens, Florent Krzakala, and Lenka Zdeborov \'a . Training dynamics of in-context learning in linear attention. arXiv preprint arXiv:2501.16265, 2025

work page arXiv 2025

[1] [1]

Transformers learn to implement preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. In Advances in Neural Information Processing Systems, volume 36, pages 45614--45650, 2023

work page 2023

[2] [2]

What learning algorithm is in-context learning? investigations with linear models

Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[3] [3]

In-context language learning: Architectures and algorithms

Ekin Aky \"u rek, Bailin Wang, Yoon Kim, and Jacob Andreas. In-context language learning: Architectures and algorithms. In International Conference on Machine Learning. PMLR, 2024

work page 2024

[4] [4]

Breaking the curse of dimensionality with convex neural networks

Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18 0 (19): 0 1--53, 2017

work page 2017

[5] [5]

Learning Theory from First Principles

Francis Bach. Learning Theory from First Principles. MIT Press, 2024

work page 2024

[6] [6]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Advances in Neural Information Processing Systems, volume 36, pages 57125--57211, 2023

work page 2023

[7] [7]

Understanding neural networks with reproducing kernel B anach spaces

Francesca Bartolucci, Ernesto De Vito, Lorenzo Rosasco, and Stefano Vigogna. Understanding neural networks with reproducing kernel B anach spaces. Applied and Computational Harmonic Analysis, 62: 0 194--236, 2023

work page 2023

[8] [8]

Understanding in-context learning in transformers and LLM s by learning to learn discrete functions

Satwik Bhattamishra, Arkil Patel, Phil Blunsom, and Varun Kanade. Understanding in-context learning in transformers and LLM s by learning to learn discrete functions. In International Conference on Learning Representations, 2024

work page 2024

[9] [9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877--1901, 2020

work page 1901

[10] [10]

Optimal rates for the regularized least-squares algorithm

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7 0 (3): 0 331--368, 2007

work page 2007

[11] [11]

Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality

Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. In The Thirty Seventh Annual Conference on Learning Theory, volume 247, pages 4573--4573. PMLR, 2024

work page 2024

[12] [12]

Transformers implement functional gradient descent to learn non-linear functions in context

Xiang Cheng, Yuxin Chen, and Suvrit Sra. Transformers implement functional gradient descent to learn non-linear functions in context. In International Conference on Machine Learning, volume 235, pages 8002--8037. PMLR, 2024

work page 2024

[13] [13]

Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, 2009

work page 2009

[14] [14]

In-context learning with transformers: Softmax attention adapts to function lipschitzness

Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, and Sanjay Shakkottai. In-context learning with transformers: Softmax attention adapts to function lipschitzness. In Advances in Neural Information Processing Systems, volume 37, pages 92638--92696, 2024

work page 2024

[15] [15]

A practical guide to splines, volume 27

Carl De Boor. A practical guide to splines, volume 27. springer New York, 1978

work page 1978

[16] [16]

Asymptotic study of in-context learning with random transformers through equivalent models, 2025

Samet Demir and Zafer Dogan. Asymptotic study of in-context learning with random transformers through equivalent models, 2025

work page 2025

[17] [17]

Nonlinear approximation

Ronald DeVore. Nonlinear approximation. Acta numerica, 7: 0 51--150, 1998

work page 1998

[18] [18]

Neural network approximation

Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30: 0 327--444, 2021

work page 2021

[19] [19]

Saxe, and Aaditya K

Sara Dragutinovi \'c , Andrew M. Saxe, and Aaditya K. Singh. Softmax linear: Transformers may learn to classify in-context by kernel gradient descent. arXiv preprint arXiv:2510.10425, 2025

work page arXiv 2025

[20] [20]

Transformers learn to achieve second-order convergence rates for in-context linear regression

Deqing Fu, Tian-Qi Chen, Robin Jia, and Vatsal Sharan. Transformers learn to achieve second-order convergence rates for in-context linear regression. In Advances in Neural Information Processing Systems, volume 37, pages 98675--98716, 2024

work page 2024

[21] [21]

Liang, and Gregory Valiant

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems, volume 35, pages 30583--30598, 2022

work page 2022

[22] [22]

Understanding emergent in-context learning from a kernel regression perspective

Chi Han, Ziqi Wang, Han Zhao, and Heng Ji. Understanding emergent in-context learning from a kernel regression perspective. Transactions on Machine Learning Research, 2025

work page 2025

[23] [23]

In-context convergence of transformers

Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 19660--19722. PMLR, 2024

work page 2024

[24] [24]

Klusowski, Jianqing Fan, and Mengdi Wang

Zihao Li, Yuan Cao, Cheng Gao, Yihang He, Han Liu, Jason M. Klusowski, Jianqing Fan, and Mengdi Wang. One-layer transformer provably learns one-nearest neighbor in context. In Advances in Neural Information Processing Systems, volume 37, pages 82166--82204, 2024

work page 2024

[25] [25]

Lu, Mary I

Yue M. Lu, Mary I. Letey, Jacob A. Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymptotic theory of in-context learning by linear attention. Proceedings of the National Academy of Sciences, 122 0 (28): 0 e2502599122, 2025

work page 2025

[26] [26]

Mahankali, Tatsunori B

Arvind V. Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. In International Conference on Learning Representations, 2024

work page 2024

[27] [27]

Banach space representer theorems for neural networks and ridge splines

Rahul Parhi and Robert D Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22 0 (43): 0 1--40, 2021

work page 2021

[28] [28]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019

work page 2019

[29] [29]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006

[30] [30]

A Theoretical Introduction to Numerical Analysis

Victor Ryaben’kii and Semyon Tsynkov. A Theoretical Introduction to Numerical Analysis. Chapman & Hall, 2006

work page 2006

[31] [31]

Towards understanding the universality of transformers for next-token prediction

Michael E Sander and Gabriel Peyr \'e . Towards understanding the universality of transformers for next-token prediction. arXiv preprint arXiv:2410.03011, 2024

work page arXiv 2024

[32] [32]

A generalized representer theorem

Bernhard Sch \"o lkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pages 416--426. Springer, 2001

work page 2001

[33] [33]

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

Zhaiming Shen, Alexander Hsu, Rongjie Lai, and Wenjing Liao. Understanding in-context learning on structured manifolds: Bridging attention to kernel methods. arXiv preprint arXiv:2506.10959, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

An introduction to the conjugate gradient method without the agonizing pain

Jonathan R Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. 1994

work page 1994

[35] [35]

On the role of transformer feed-forward layers in nonlinear in-context learning, 2025

Haoyuan Sun, Ali Jadbabaie, and Navid Azizan. On the role of transformer feed-forward layers in nonlinear in-context learning, 2025

work page 2025

[36] [36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[38] [38]

Linear transformers are versatile in-context learners

Max Vladymyrov, Johannes von Oswald, Mark Sandler, and Rong Ge. Linear transformers are versatile in-context learners. In Advances in Neural Information Processing Systems, volume 37, pages 48784--48809, 2024

work page 2024

[39] [39]

Transformers learn in-context by gradient descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023

work page 2023

[40] [40]

Sparse representer theorems for learning in reproducing kernel B anach spaces

Rui Wang, Yuesheng Xu, and Mingsong Yan. Sparse representer theorems for learning in reproducing kernel B anach spaces. Journal of Machine Learning Research, 25 0 (93): 0 1--45, 2024

work page 2024

[41] [41]

Hypothesis spaces for deep learning

Rui Wang, Yuesheng Xu, and Mingsong Yan. Hypothesis spaces for deep learning. Neural Networks, page 107995, 2025

work page 2025

[42] [42]

Bartlett

Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024 a

work page 2024

[43] [43]

Bartlett

Ruiqi Zhang, Jingfeng Wu, and Peter L. Bartlett. In-context learning of a linear transformer block: Benefits of the MLP component and one-step GD initialization. In Advances in Neural Information Processing Systems, volume 37, pages 18310--18361, 2024 b

work page 2024

[44] [44]

Training dynamics of in-context learning in linear attention

Yedi Zhang, Freya Behrens, Florent Krzakala, and Lenka Zdeborov \'a . Training dynamics of in-context learning in linear attention. arXiv preprint arXiv:2501.16265, 2025

work page arXiv 2025