Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization

Ding-Xuan Zhou; Peilin Liu

arxiv: 2607.00479 · v1 · pith:B2LQU7P7new · submitted 2026-07-01 · 💻 cs.LG · stat.ML

Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization

Peilin Liu , Ding-Xuan Zhou This is my paper

Pith reviewed 2026-07-02 16:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords linear transformersin-context learningdomain generalizationgeneralization boundsefficient attentiontransformer linearizationactivation design

0 comments

The pith

Linear transformers perform in-context learning by mapping context distributions to response functions under domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes linear transformers, which reduce attention complexity from quadratic to linear in context length, through the lens of domain generalization with a two-staged sampling process. It frames in-context learning as the model learning a mapping from distributions over contexts to response functions, and derives generalization bounds that hold independently of dimension while exposing a tradeoff in regularity between data distributions and latent features. This view also informs new choices for activations and losses when converting pretrained softmax-based large language models into linear form. A reader would care because the work supplies a concrete theoretical account of how efficient transformers can still adapt to new tasks from context alone, without parameter updates or quadratic costs.

Core claim

Linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.

What carries the argument

The two-staged sampling process from domain generalization, used to analyze the feature mapping inside linear attention.

If this is right

Linear transformers achieve in-context learning with generalization rates independent of dimension.
Convergence rates reflect a tradeoff between regularity of the data distributions and regularity of the latent features.
Activation and loss functions can be redesigned to convert pretrained softmax transformers into linear versions while preserving in-context capability.
The domain-generalization framing supplies a route to theoretical guarantees for other efficient attention variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage sampling lens could be applied to analyze kernel approximations in other linear or sparse attention mechanisms.
If the mapping interpretation is accurate, one could test whether real-world context distributions in language tasks exhibit the regularity levels needed for the predicted rates.
The tradeoff between distribution and feature regularity suggests a practical knob for choosing regularization strength when training linear transformers on heterogeneous data.

Load-bearing premise

The two-staged sampling process from domain generalization accurately captures the mechanism of in-context learning in linear transformers.

What would settle it

An empirical measurement showing that the generalization error rate of a linear transformer on in-context tasks depends on input dimension, or that the learned mapping fails to align with the response functions predicted by the two-stage model.

Figures

Figures reproduced from arXiv: 2607.00479 by Ding-Xuan Zhou, Peilin Liu.

read the original abstract

Transformer-based large models have demonstrated remarkable generalization abilities across different tasks by leveraging a context-aware attention module for in-context learning. With richer context, transformers adapt more effectively to the current use case without any parameter updates. However, the quadratic computational and memory complexity with respect to context length significantly slows data processing in softmax transformers. Linear transformers were proposed to address this issue by reducing the complexity to linear dependence on context length, but the design and understanding of the feature mapping in linear attention, from a theoretical viewpoint, remain unclear. In this paper, we investigate the approximation and generalization abilities of linear transformers under a two-staged sampling process from domain generalization. We show that linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The domain generalization sampling gives dimension-independent ICL bounds for linear transformers but rests on a modeling choice that may not match standard in-context learning.

read the letter

The main thing to know is that this paper frames linear transformer in-context learning as learning a mapping from context distributions to response functions under a two-staged domain generalization sampling process, and derives a dimension-independent convergence rate plus a regularity tradeoff between data distributions and latent features. They then use the framework to suggest activation and loss designs for linearizing pretrained models.

The new element is the explicit domain generalization lens and the resulting dimension-free rate. Prior linear attention work focused on kernel approximations or feature maps, but this approach treats the context as sampled in two stages (latent features first, then conditioned examples) to obtain generalization bounds that avoid dimension dependence. That is a concrete technical step if the setup holds.

The paper does a reasonable job laying out the approximation and generalization analysis under its chosen sampling. The tradeoff observation is a direct consequence of the bounds and could inform design choices.

The soft spot is the modeling assumption. The two-staged sampling is used to derive the rates and the design recommendations, yet it is not obvious that it reproduces the distribution over context-response pairs that arises in ordinary ICL, where a task is fixed and examples are drawn from the same distribution. If the induced distributions differ, the bounds and the suggested activation/loss changes rest on an analogy rather than on the actual attention computation. The abstract gives no indication that this gap is closed by comparison to standard ICL or by additional experiments.

The work is aimed at theorists who care about efficient transformer analysis and generalization. A reader already working on linear attention or ICL bounds could extract the rate and the tradeoff for further study.

It is worth sending to peer review. The dimension-independent result is worth checking in detail even if the sampling model needs justification or relaxation.

Referee Report

2 major / 1 minor

Summary. The manuscript frames linear transformers' in-context learning as learning a mapping from context distributions to response functions under a two-staged sampling process drawn from domain generalization. It derives a dimension-independent convergence rate for the generalization analysis, identifies a regularity tradeoff between data distributions and latent features, and uses the framework to recommend new activation and loss designs for linearizing pretrained softmax LLMs.

Significance. If the two-staged sampling model is valid, the dimension-independent rate and the resulting activation/loss perspective would offer a useful theoretical bridge between domain generalization and efficient transformer design, with potential impact on practical linear attention implementations.

major comments (2)

[Section 3 (two-staged sampling definition) and the statements of the main generalization theorem] The central modeling assumption—that the two-staged sampling process (latent features drawn first, then context conditioned on them) reproduces the distribution of context-response pairs arising in standard ICL—is invoked to obtain both the approximation and generalization bounds as well as the activation/loss recommendations. No justification, comparison to fixed-task ICL sampling, or empirical check is supplied, rendering the derived rates and design guidance dependent on an unverified modeling choice.
[Main generalization theorem (the convergence-rate statement)] The dimension-independent convergence rate is stated to hold under the regularity tradeoff; however, the precise dependence on the regularity parameters of the data distribution versus the latent features is not made explicit in the bound statement, so it is unclear whether the rate remains dimension-free once those parameters are allowed to vary with dimension.

minor comments (1)

[Preliminaries] Notation for the response function and the induced mapping from context distributions is introduced without an explicit comparison table to the standard attention formulation; a short side-by-side would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Section 3 (two-staged sampling definition) and the statements of the main generalization theorem] The central modeling assumption—that the two-staged sampling process (latent features drawn first, then context conditioned on them) reproduces the distribution of context-response pairs arising in standard ICL—is invoked to obtain both the approximation and generalization bounds as well as the activation/loss recommendations. No justification, comparison to fixed-task ICL sampling, or empirical check is supplied, rendering the derived rates and design guidance dependent on an unverified modeling choice.

Authors: The two-staged sampling is introduced to connect the ICL setting to the domain generalization literature, where a similar hierarchical process models distribution shifts between contexts. We agree that the manuscript would benefit from explicit motivation. In revision we will expand Section 3 with a dedicated paragraph motivating the choice, citing relevant domain-generalization works that employ analogous two-stage sampling, and providing a brief comparison to the fixed-task ICL sampling used in prior transformer analyses. As the contribution is primarily theoretical, we will flag empirical validation of the modeling assumption as future work rather than adding new experiments. revision: partial
Referee: [Main generalization theorem (the convergence-rate statement)] The dimension-independent convergence rate is stated to hold under the regularity tradeoff; however, the precise dependence on the regularity parameters of the data distribution versus the latent features is not made explicit in the bound statement, so it is unclear whether the rate remains dimension-free once those parameters are allowed to vary with dimension.

Authors: We agree that the dependence should be stated explicitly. The dimension-free rate holds when the regularity parameters (smoothness, boundedness, etc.) of both the data distributions and the latent features remain independent of dimension and satisfy the stated tradeoff. We will revise the main theorem statement to include this explicit condition, thereby clarifying that the convergence rate is dimension-independent precisely under the assumption that these parameters do not scale with dimension. revision: yes

Circularity Check

0 steps flagged

No circularity: modeling choice followed by independent analysis

full rationale

The paper adopts a two-staged sampling process drawn from domain generalization as the framework for analyzing linear transformer ICL, then derives a mapping from context distributions to response functions plus dimension-independent convergence rates under that model. No equations are supplied in the abstract or visible text that reduce the claimed mapping or rates to fitted parameters or self-definitions by construction. No self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the provided material. The derivation therefore remains self-contained against external benchmarks once the modeling assumption is granted; the skeptic concern targets the realism of the assumption itself rather than any internal reduction of the claimed results to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5706 in / 1223 out tokens · 55468 ms · 2026-07-02T16:18:27.227640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Akyürek, D

E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations , 2023

2023
[2]

F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017

2017
[3]

A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory , 39(3):930–945, 1993

1993
[4]

Blanchard, G

G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems , volume 24. Curran Associates, Inc., 2011

2011
[5]

Blanchard, A

G. Blanchard, A. A. Deshmukh, U. Dogan, G. Lee, and C. Scott. Domain generalization by marginal transfer learning. Journal of machine learning research , 22(2):1–55, 2021

2021
[6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

1901
[7]

K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations , 2021

2021
[8]

Christmann and I

A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Advances in Neural Information Processing Systems , volume 23. Curran Associates, Inc., 2010

2010
[9]

Cucker and D

F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint , volume 24. Cambridge University Press, 2007

2007
[10]

De Ryck, S

T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural networks. Neural Networks , 143:732–750, 2021

2021
[11]

De Ryck, A

T. De Ryck, A. D. Jagtap, and S. Mishra. Error estimates for physics-informed neural networks approximating the navier–stokes equations. IMA Journal of Numerical Analysis , page drac085, 2023

2023
[12]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019

2019
[13]

G. E. Fasshauer, F. J. Hickernell, and H. Woźniakowski. On dimension-independent rates of convergence for function approximation with gaussian kernels. SIAM Journal on Numerical Analysis, 50(1):247–271, 2012

2012
[14]

Furuya, M

T. Furuya, M. V. de Hoop, and G. Peyré. Transformers are universal in-context learners. In The Thirteenth International Conference on Learning Representations , 2025

2025
[15]

S. Garg, D. Tsipras, P. S. Liang, and G. Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems , 35: 30583–30598, 2022. 50

2022
[16]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling , 2024

2024
[17]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[18]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm, 2019

2019
[20]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autore- gressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning , pages 5156–5165. PMLR, 2020

2020
[21]

Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on Mathematical Analysis, 54(6):6358–6389, 2022

2022
[22]

Ledoux and M

M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991. ISBN 978-3-642-20211-7 978-3-642-20212-4

1991
[23]

Liu and D.-X

P. Liu and D.-X. Zhou. Generalization analysis of transformers in distribution regression. Neural Computation, 37(2):260–293, 2025

2025
[24]

C. Ma, R. Pathak, and M. J. Wainwright. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics , 51(2):738–761, 2023

2023
[25]

A. Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory , pages 3–17. Springer, 2016

2016
[26]

Maurer and M

A. Maurer and M. Pontil. Concentration inequalities under sub-gaussian and sub-exponential conditions. In Advances in Neural Information Processing Systems, volume 34, pages 7588–7597. Curran Associates, Inc., 2021

2021
[27]

S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Re- thinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 11048–11064, 2022

2022
[28]

N. Mücke. Stochastic gradient descent meets distribution regression. In International Confer- ence on Artificial Intelligence and Statistics , pages 2143–2151. PMLR, 2021

2021
[29]

Nguyen and N

M. Nguyen and N. Mücke. Optimal convergence rates for neural operators. arXiv preprint arXiv:2412.17518, 2024

work page arXiv 2024
[30]

Novak and H

E. Novak and H. Woźniakowski. Tractability of Multivariate Problems. 1: Linear Information . Number 6. European Mathematical Soc, Zürich, 2008. ISBN 978-3-03719-026-5

2008
[31]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4195–4205, 2023

2023
[32]

Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y. Zhong. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, 2022. 51

2022
[33]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9

2006
[35]

A. Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, volume 4, pages 547–562. University of California Press, 1961

1961
[36]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024
[37]

N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[38]

Z. Shen, A. Hsu, R. Lai, and W. Liao. Understanding in-context learning on structured mani- folds: Bridging attention to kernel methods. arXiv preprint arXiv:2506.10959 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

J. W. Siegel. Optimal approximation of zonoids and uniform approximation by shallow neural networks. Constructive Approximation, pages 1–29, 2025

2025
[40]

J. W. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics , 24(2):481–537, 2024

2024
[41]

Song and S

Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019

2019
[42]

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010

2010
[43]

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , page...

2019
[45]

van Erven and P

T. van Erven and P. Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory , 60(7):3797–3820, 2014

2014
[46]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

2017
[47]

M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge University Press, 1 edition, Feb. 2019. ISBN 978-1-108-62777-1 978-1-108-49802-9

2019
[48]

S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations , 2022. 52

2022
[49]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

2025
[50]

S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-eﬀicient training. In Forty-first International Conference on Machine Learning , 2024

2024
[51]

Yang and D.-X

Y. Yang and D.-X. Zhou. Optimal rates of approximation by shallow relu$$^k$$neural networks and applications to nonparametric regression. Constructive Approximation, 2024

2024
[52]

Zaheer, G

M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems , volume 33, pages 17283–17297. Curran Associates, Inc., 2020

2020
[53]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in neural infor- mation processing systems , 32, 2019

2019
[54]

Zhang, K

M. Zhang, K. Bhatia, H. Kumbong, and C. Re. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations, 2024

2024
[55]

Zhang, S

R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research , 25(49):1–55, 2024

2024
[56]

D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics , 220(1):456–463, 2008. 53

2008

[1] [1]

Akyürek, D

E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations , 2023

2023

[2] [2]

F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017

2017

[3] [3]

A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory , 39(3):930–945, 1993

1993

[4] [4]

Blanchard, G

G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems , volume 24. Curran Associates, Inc., 2011

2011

[5] [5]

Blanchard, A

G. Blanchard, A. A. Deshmukh, U. Dogan, G. Lee, and C. Scott. Domain generalization by marginal transfer learning. Journal of machine learning research , 22(2):1–55, 2021

2021

[6] [6]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

1901

[7] [7]

K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations , 2021

2021

[8] [8]

Christmann and I

A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Advances in Neural Information Processing Systems , volume 23. Curran Associates, Inc., 2010

2010

[9] [9]

Cucker and D

F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint , volume 24. Cambridge University Press, 2007

2007

[10] [10]

De Ryck, S

T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural networks. Neural Networks , 143:732–750, 2021

2021

[11] [11]

De Ryck, A

T. De Ryck, A. D. Jagtap, and S. Mishra. Error estimates for physics-informed neural networks approximating the navier–stokes equations. IMA Journal of Numerical Analysis , page drac085, 2023

2023

[12] [12]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019

2019

[13] [13]

G. E. Fasshauer, F. J. Hickernell, and H. Woźniakowski. On dimension-independent rates of convergence for function approximation with gaussian kernels. SIAM Journal on Numerical Analysis, 50(1):247–271, 2012

2012

[14] [14]

Furuya, M

T. Furuya, M. V. de Hoop, and G. Peyré. Transformers are universal in-context learners. In The Thirteenth International Conference on Learning Representations , 2025

2025

[15] [15]

S. Garg, D. Tsipras, P. S. Liang, and G. Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems , 35: 30583–30598, 2022. 50

2022

[16] [16]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling , 2024

2024

[17] [17]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022

[18] [18]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm, 2019

2019

[20] [20]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autore- gressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning , pages 5156–5165. PMLR, 2020

2020

[21] [21]

Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on Mathematical Analysis, 54(6):6358–6389, 2022

2022

[22] [22]

Ledoux and M

M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991. ISBN 978-3-642-20211-7 978-3-642-20212-4

1991

[23] [23]

Liu and D.-X

P. Liu and D.-X. Zhou. Generalization analysis of transformers in distribution regression. Neural Computation, 37(2):260–293, 2025

2025

[24] [24]

C. Ma, R. Pathak, and M. J. Wainwright. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics , 51(2):738–761, 2023

2023

[25] [25]

A. Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory , pages 3–17. Springer, 2016

2016

[26] [26]

Maurer and M

A. Maurer and M. Pontil. Concentration inequalities under sub-gaussian and sub-exponential conditions. In Advances in Neural Information Processing Systems, volume 34, pages 7588–7597. Curran Associates, Inc., 2021

2021

[27] [27]

S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Re- thinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 11048–11064, 2022

2022

[28] [28]

N. Mücke. Stochastic gradient descent meets distribution regression. In International Confer- ence on Artificial Intelligence and Statistics , pages 2143–2151. PMLR, 2021

2021

[29] [29]

Nguyen and N

M. Nguyen and N. Mücke. Optimal convergence rates for neural operators. arXiv preprint arXiv:2412.17518, 2024

work page arXiv 2024

[30] [30]

Novak and H

E. Novak and H. Woźniakowski. Tractability of Multivariate Problems. 1: Linear Information . Number 6. European Mathematical Soc, Zürich, 2008. ISBN 978-3-03719-026-5

2008

[31] [31]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4195–4205, 2023

2023

[32] [32]

Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y. Zhong. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, 2022. 51

2022

[33] [33]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9

2006

[35] [35]

A. Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, volume 4, pages 547–562. University of California Press, 1961

1961

[36] [36]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024

[37] [37]

N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[38] [38]

Z. Shen, A. Hsu, R. Lai, and W. Liao. Understanding in-context learning on structured mani- folds: Bridging attention to kernel methods. arXiv preprint arXiv:2506.10959 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

J. W. Siegel. Optimal approximation of zonoids and uniform approximation by shallow neural networks. Constructive Approximation, pages 1–29, 2025

2025

[40] [40]

J. W. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics , 24(2):481–537, 2024

2024

[41] [41]

Song and S

Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019

2019

[42] [42]

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010

2010

[43] [43]

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , page...

2019

[45] [45]

van Erven and P

T. van Erven and P. Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory , 60(7):3797–3820, 2014

2014

[46] [46]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

2017

[47] [47]

M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge University Press, 1 edition, Feb. 2019. ISBN 978-1-108-62777-1 978-1-108-49802-9

2019

[48] [48]

S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations , 2022. 52

2022

[49] [49]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

2025

[50] [50]

S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-eﬀicient training. In Forty-first International Conference on Machine Learning , 2024

2024

[51] [51]

Yang and D.-X

Y. Yang and D.-X. Zhou. Optimal rates of approximation by shallow relu$$^k$$neural networks and applications to nonparametric regression. Constructive Approximation, 2024

2024

[52] [52]

Zaheer, G

M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems , volume 33, pages 17283–17297. Curran Associates, Inc., 2020

2020

[53] [53]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in neural infor- mation processing systems , 32, 2019

2019

[54] [54]

Zhang, K

M. Zhang, K. Bhatia, H. Kumbong, and C. Re. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations, 2024

2024

[55] [55]

Zhang, S

R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research , 25(49):1–55, 2024

2024

[56] [56]

D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics , 220(1):456–463, 2008. 53

2008