Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization
Pith reviewed 2026-07-02 16:18 UTC · model grok-4.3
The pith
Linear transformers perform in-context learning by mapping context distributions to response functions under domain generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.
What carries the argument
The two-staged sampling process from domain generalization, used to analyze the feature mapping inside linear attention.
If this is right
- Linear transformers achieve in-context learning with generalization rates independent of dimension.
- Convergence rates reflect a tradeoff between regularity of the data distributions and regularity of the latent features.
- Activation and loss functions can be redesigned to convert pretrained softmax transformers into linear versions while preserving in-context capability.
- The domain-generalization framing supplies a route to theoretical guarantees for other efficient attention variants.
Where Pith is reading between the lines
- The same two-stage sampling lens could be applied to analyze kernel approximations in other linear or sparse attention mechanisms.
- If the mapping interpretation is accurate, one could test whether real-world context distributions in language tasks exhibit the regularity levels needed for the predicted rates.
- The tradeoff between distribution and feature regularity suggests a practical knob for choosing regularization strength when training linear transformers on heterogeneous data.
Load-bearing premise
The two-staged sampling process from domain generalization accurately captures the mechanism of in-context learning in linear transformers.
What would settle it
An empirical measurement showing that the generalization error rate of a linear transformer on in-context tasks depends on input dimension, or that the learned mapping fails to align with the response functions predicted by the two-stage model.
Figures
read the original abstract
Transformer-based large models have demonstrated remarkable generalization abilities across different tasks by leveraging a context-aware attention module for in-context learning. With richer context, transformers adapt more effectively to the current use case without any parameter updates. However, the quadratic computational and memory complexity with respect to context length significantly slows data processing in softmax transformers. Linear transformers were proposed to address this issue by reducing the complexity to linear dependence on context length, but the design and understanding of the feature mapping in linear attention, from a theoretical viewpoint, remain unclear. In this paper, we investigate the approximation and generalization abilities of linear transformers under a two-staged sampling process from domain generalization. We show that linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames linear transformers' in-context learning as learning a mapping from context distributions to response functions under a two-staged sampling process drawn from domain generalization. It derives a dimension-independent convergence rate for the generalization analysis, identifies a regularity tradeoff between data distributions and latent features, and uses the framework to recommend new activation and loss designs for linearizing pretrained softmax LLMs.
Significance. If the two-staged sampling model is valid, the dimension-independent rate and the resulting activation/loss perspective would offer a useful theoretical bridge between domain generalization and efficient transformer design, with potential impact on practical linear attention implementations.
major comments (2)
- [Section 3 (two-staged sampling definition) and the statements of the main generalization theorem] The central modeling assumption—that the two-staged sampling process (latent features drawn first, then context conditioned on them) reproduces the distribution of context-response pairs arising in standard ICL—is invoked to obtain both the approximation and generalization bounds as well as the activation/loss recommendations. No justification, comparison to fixed-task ICL sampling, or empirical check is supplied, rendering the derived rates and design guidance dependent on an unverified modeling choice.
- [Main generalization theorem (the convergence-rate statement)] The dimension-independent convergence rate is stated to hold under the regularity tradeoff; however, the precise dependence on the regularity parameters of the data distribution versus the latent features is not made explicit in the bound statement, so it is unclear whether the rate remains dimension-free once those parameters are allowed to vary with dimension.
minor comments (1)
- [Preliminaries] Notation for the response function and the induced mapping from context distributions is introduced without an explicit comparison table to the standard attention formulation; a short side-by-side would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Section 3 (two-staged sampling definition) and the statements of the main generalization theorem] The central modeling assumption—that the two-staged sampling process (latent features drawn first, then context conditioned on them) reproduces the distribution of context-response pairs arising in standard ICL—is invoked to obtain both the approximation and generalization bounds as well as the activation/loss recommendations. No justification, comparison to fixed-task ICL sampling, or empirical check is supplied, rendering the derived rates and design guidance dependent on an unverified modeling choice.
Authors: The two-staged sampling is introduced to connect the ICL setting to the domain generalization literature, where a similar hierarchical process models distribution shifts between contexts. We agree that the manuscript would benefit from explicit motivation. In revision we will expand Section 3 with a dedicated paragraph motivating the choice, citing relevant domain-generalization works that employ analogous two-stage sampling, and providing a brief comparison to the fixed-task ICL sampling used in prior transformer analyses. As the contribution is primarily theoretical, we will flag empirical validation of the modeling assumption as future work rather than adding new experiments. revision: partial
-
Referee: [Main generalization theorem (the convergence-rate statement)] The dimension-independent convergence rate is stated to hold under the regularity tradeoff; however, the precise dependence on the regularity parameters of the data distribution versus the latent features is not made explicit in the bound statement, so it is unclear whether the rate remains dimension-free once those parameters are allowed to vary with dimension.
Authors: We agree that the dependence should be stated explicitly. The dimension-free rate holds when the regularity parameters (smoothness, boundedness, etc.) of both the data distributions and the latent features remain independent of dimension and satisfy the stated tradeoff. We will revise the main theorem statement to include this explicit condition, thereby clarifying that the convergence rate is dimension-independent precisely under the assumption that these parameters do not scale with dimension. revision: yes
Circularity Check
No circularity: modeling choice followed by independent analysis
full rationale
The paper adopts a two-staged sampling process drawn from domain generalization as the framework for analyzing linear transformer ICL, then derives a mapping from context distributions to response functions plus dimension-independent convergence rates under that model. No equations are supplied in the abstract or visible text that reduce the claimed mapping or rates to fitted parameters or self-definitions by construction. No self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the provided material. The derivation therefore remains self-contained against external benchmarks once the modeling assumption is granted; the skeptic concern targets the realism of the assumption itself rather than any internal reduction of the claimed results to their inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Akyürek, D
E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in- context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations , 2023
2023
-
[2]
F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017
2017
-
[3]
A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory , 39(3):930–945, 1993
1993
-
[4]
Blanchard, G
G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems , volume 24. Curran Associates, Inc., 2011
2011
-
[5]
Blanchard, A
G. Blanchard, A. A. Deshmukh, U. Dogan, G. Lee, and C. Scott. Domain generalization by marginal transfer learning. Journal of machine learning research , 22(2):1–55, 2021
2021
-
[6]
Brown, B
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
1901
-
[7]
K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations , 2021
2021
-
[8]
Christmann and I
A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Advances in Neural Information Processing Systems , volume 23. Curran Associates, Inc., 2010
2010
-
[9]
Cucker and D
F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint , volume 24. Cambridge University Press, 2007
2007
-
[10]
De Ryck, S
T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural networks. Neural Networks , 143:732–750, 2021
2021
-
[11]
De Ryck, A
T. De Ryck, A. D. Jagtap, and S. Mishra. Error estimates for physics-informed neural networks approximating the navier–stokes equations. IMA Journal of Numerical Analysis , page drac085, 2023
2023
-
[12]
Devlin, M.-W
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019
2019
-
[13]
G. E. Fasshauer, F. J. Hickernell, and H. Woźniakowski. On dimension-independent rates of convergence for function approximation with gaussian kernels. SIAM Journal on Numerical Analysis, 50(1):247–271, 2012
2012
-
[14]
Furuya, M
T. Furuya, M. V. de Hoop, and G. Peyré. Transformers are universal in-context learners. In The Thirteenth International Conference on Learning Representations , 2025
2025
-
[15]
S. Garg, D. Tsipras, P. S. Liang, and G. Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems , 35: 30583–30598, 2022. 50
2022
-
[16]
Gu and T
A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling , 2024
2024
-
[17]
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
2022
-
[18]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm, 2019
2019
-
[20]
Katharopoulos, A
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autore- gressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning , pages 5156–5165. PMLR, 2020
2020
-
[21]
Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on Mathematical Analysis, 54(6):6358–6389, 2022
2022
-
[22]
Ledoux and M
M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991. ISBN 978-3-642-20211-7 978-3-642-20212-4
1991
-
[23]
Liu and D.-X
P. Liu and D.-X. Zhou. Generalization analysis of transformers in distribution regression. Neural Computation, 37(2):260–293, 2025
2025
-
[24]
C. Ma, R. Pathak, and M. J. Wainwright. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics , 51(2):738–761, 2023
2023
-
[25]
A. Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory , pages 3–17. Springer, 2016
2016
-
[26]
Maurer and M
A. Maurer and M. Pontil. Concentration inequalities under sub-gaussian and sub-exponential conditions. In Advances in Neural Information Processing Systems, volume 34, pages 7588–7597. Curran Associates, Inc., 2021
2021
-
[27]
S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Re- thinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 11048–11064, 2022
2022
-
[28]
N. Mücke. Stochastic gradient descent meets distribution regression. In International Confer- ence on Artificial Intelligence and Statistics , pages 2143–2151. PMLR, 2021
2021
-
[29]
M. Nguyen and N. Mücke. Optimal convergence rates for neural operators. arXiv preprint arXiv:2412.17518, 2024
-
[30]
Novak and H
E. Novak and H. Woźniakowski. Tractability of Multivariate Problems. 1: Linear Information . Number 6. European Mathematical Soc, Zürich, 2008. ISBN 978-3-03719-026-5
2008
-
[31]
Peebles and S
W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4195–4205, 2023
2023
-
[32]
Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y. Zhong. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, 2022. 51
2022
-
[33]
Searching for Activation Functions
P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9
2006
-
[35]
A. Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, volume 4, pages 547–562. University of California Press, 1961
1961
-
[36]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
2024
-
[37]
N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[38]
Z. Shen, A. Hsu, R. Lai, and W. Liao. Understanding in-context learning on structured mani- folds: Bridging attention to kernel methods. arXiv preprint arXiv:2506.10959 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
J. W. Siegel. Optimal approximation of zonoids and uniform approximation by shallow neural networks. Constructive Approximation, pages 1–29, 2025
2025
-
[40]
J. W. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics , 24(2):481–537, 2024
2024
-
[41]
Song and S
Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019
2019
-
[42]
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010
2010
-
[43]
Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , page...
2019
-
[45]
van Erven and P
T. van Erven and P. Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory , 60(7):3797–3820, 2014
2014
-
[46]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017
2017
-
[47]
M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge University Press, 1 edition, Feb. 2019. ISBN 978-1-108-62777-1 978-1-108-49802-9
2019
-
[48]
S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations , 2022. 52
2022
-
[49]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
2025
-
[50]
S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-efficient training. In Forty-first International Conference on Machine Learning , 2024
2024
-
[51]
Yang and D.-X
Y. Yang and D.-X. Zhou. Optimal rates of approximation by shallow relu$$^k$$neural networks and applications to nonparametric regression. Constructive Approximation, 2024
2024
-
[52]
Zaheer, G
M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems , volume 33, pages 17283–17297. Curran Associates, Inc., 2020
2020
-
[53]
Zhang and R
B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in neural infor- mation processing systems , 32, 2019
2019
-
[54]
Zhang, K
M. Zhang, K. Bhatia, H. Kumbong, and C. Re. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[55]
Zhang, S
R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research , 25(49):1–55, 2024
2024
-
[56]
D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics , 220(1):456–463, 2008. 53
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.