pith. machine review for the scientific record. sign in

arxiv: 2605.10466 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-attentionin-context learningrepetitive generationcovariance readoutsoftmax attentionmode collapsetransformer limitergodic inputs
0
0 comments X

The pith

Under stationary ergodic elliptical inputs, softmax self-attention converges almost surely to a linear readout of the input covariance matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives that self-attention in transformers, when context grows long, stops tracking individual tokens and instead outputs a simple linear function of the second-moment statistics of the entire sequence. This single mechanism accounts for why models can perform in-context regression by effectively running gradient steps on population covariance and why long generations drift toward repetitive, Markovian behavior. A reader sees both in-context learning and mode collapse as direct consequences of the same asymptotic simplification rather than separate quirks of training or scale.

Core claim

Under the assumptions of stationarity, ergodicity and ellipticity, the softmax attention output converges almost surely to Θ_V Σ Θ_K^T Θ_Q x_t where Σ denotes the covariance of the input sequence. Consequently a single attention head realizes one population gradient step for in-context linear regression; residual stacking of heads iterates the update. Across L layers the terminal representation converges at rate 1/t to a deterministic function of the current token alone, turning autoregressive sampling into a first-order Markov chain whose attractors explain repetitive generation.

What carries the argument

Covariance readout: the almost-sure limit of softmax attention output to the form Θ_V Σ Θ_K^T Θ_Q x_t that discards token-level detail in favor of second-order input statistics.

If this is right

  • A single softmax head implements one step of population gradient descent on in-context linear regression.
  • Residual stacking of such heads iterates the gradient update over multiple steps.
  • In an L-layer stack the final hidden state converges to a deterministic function of the current token at rate 1/t.
  • Autoregressive generation collapses to a first-order Markov chain whose attracting orbits produce repetition and mode collapse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests attention layers are structurally biased toward second-order moments, which may limit their ability to capture higher-order dependencies even in very long contexts.
  • Architectural changes that preserve token-level detail rather than converging to covariance could reduce unwanted repetition without harming in-context learning.
  • The unification predicts a trade-off: stronger covariance readout improves ICL consistency but accelerates drift into repetitive orbits.

Load-bearing premise

The input sequence must be stationary, ergodic, and elliptically distributed.

What would settle it

Generate long stationary ergodic elliptical sequences, compute the empirical attention output, and check whether it approaches the predicted matrix expression Θ_V Σ Θ_K^T Θ_Q x_t within sampling error; divergence on such data would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.10466 by Guanhua Fang, Haoren Xu.

Figure 1
Figure 1. Figure 1: Convergence of the joint-covariance readout for in-context linear regression. Panels (a)–(b) [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Covariance-readout alignment under elliptical inputs. Each panel shows the cosine similarity [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
read the original abstract

Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $\Theta_V\Sigma\Theta_K^{\top}\Theta_Q x_t$, where $\Sigma$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript claims that, under the assumptions of stationary, ergodic, and elliptical input token distributions, the output of a softmax self-attention layer converges almost surely to the linear covariance readout Θ_V Σ Θ_K^T Θ_Q x_t (where Σ denotes the population covariance of the inputs). This limiting form is then used to show that a single attention head can realize one step of population gradient descent on in-context linear regression, that stacked residual heads iterate this update, and that propagation through an L-layer transformer drives the terminal hidden state to a deterministic function of the current token at rate 1/t, implying that autoregressive generation asymptotically collapses to a first-order Markov chain whose attractors explain repetition and mode collapse.

Significance. If the almost-sure convergence result is correct, the work supplies a single mechanistic derivation for two seemingly unrelated LLM behaviors (in-context learning and repetitive generation) directly from the attention operator, without additional learned parameters or external memory. The explicit use of the ergodic theorem followed by a continuous-mapping argument through the softmax, together with the concrete identification of the limiting linear form, constitutes a clear technical contribution. The paper also supplies falsifiable predictions (e.g., the 1/t decay of hidden-state variance and the Markovian character of long generations) that can be checked empirically.

minor comments (4)
  1. [§3.1, Eq. (8)] §3.1, Eq. (8): the statement that the elliptical condition “ensures that only second moments appear” would be strengthened by an explicit reference to the relevant property of elliptical distributions (e.g., that the conditional expectation depends only on the covariance).
  2. [§4.2] §4.2: the claim that residual stacking “implements multiple gradient-descent steps” is stated without a precise induction or contraction-mapping argument showing that the composite map remains a contraction for finite L; a short lemma would remove ambiguity.
  3. [Figure 2] Figure 2 and the surrounding text: the plotted trajectories are described as “empirical confirmation,” yet the caption does not state the number of independent runs, the precise input distribution used, or error bars; this makes it difficult to assess variability of the observed 1/t decay.
  4. [§5.3] §5.3: the discussion of mode collapse would benefit from a brief comparison with existing analyses of repetition in autoregressive models (e.g., those based on temperature or nucleus sampling) to clarify the novel contribution of the covariance-readout view.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. We are pleased that the technical contribution of the almost-sure convergence result, its unification of in-context learning and repetitive generation, and the falsifiable predictions are recognized. No major comments were raised, so we will proceed with minor revisions to address any editorial or presentational suggestions.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on ergodic theorem and continuous mapping under explicit assumptions

full rationale

The central claim derives the almost-sure convergence of the softmax attention output to Θ_V Σ Θ_K^T Θ_Q x_t from the assumptions of stationarity, ergodicity, and ellipticity by applying the ergodic theorem to the empirical second-moment matrix and then invoking continuous mapping through the softmax. This is a standard probabilistic argument that does not reduce any prediction or limit to a fitted parameter, self-definition, or self-citation chain. The elliptical condition is used only to restrict the limit to second moments, which is consistent with the stated result and does not create a definitional loop. No load-bearing steps collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about the input process and the standard softmax attention definition; no free parameters are fitted within the derivation itself.

axioms (1)
  • domain assumption Input sequence is stationary, ergodic, and elliptically distributed
    Invoked to obtain the almost-sure convergence of the attention output to the covariance readout.

pith-pipeline@v0.9.0 · 5534 in / 1279 out tokens · 111934 ms · 2026-05-12T03:56:14.755602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  2. [2]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  3. [3]

    International Conference on Machine Learning , pages=

    Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  4. [5]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  5. [7]

    Advances in Neural Information Processing Systems , volume=

    Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning , author=. Advances in Neural Information Processing Systems , volume=

  6. [9]

    Advances in neural information processing systems , volume=

    Data distributional properties drive emergent in-context learning in transformers , author=. Advances in neural information processing systems , volume=

  7. [12]

    2021 , eprint=

    A Theoretical Analysis of the Repetition Problem in Text Generation , author=. 2021 , eprint=

  8. [13]

    Advances in Neural Information Processing Systems , volume=

    Learning to break the loop: Analyzing and mitigating repetitions for neural text generation , author=. Advances in Neural Information Processing Systems , volume=

  9. [14]

    Journal of Machine Learning Research , volume=

    Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=

  10. [15]

    Transformer Circuits Thread , volume=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

  11. [16]

    2017 , publisher=

    Asymptotic theory of weakly dependent random processes , author=. 2017 , publisher=

  12. [17]

    What learning algorithm is in-context learning? investigations with linear models

    Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022

  13. [18]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  14. [19]

    Data distributional properties drive emergent in-context learning in transformers

    Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems, 35: 0 18878--18891, 2022

  15. [20]

    Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005--4019, 2023

  16. [21]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1 0 (1): 0 12, 2021

  17. [22]

    A theoretical analysis of the repetition problem in text generation, 2021

    Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. A theoretical analysis of the repetition problem in text generation, 2021. URL https://arxiv.org/abs/2012.14660

  18. [23]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  19. [24]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  20. [25]

    Asymptotic theory of weakly dependent random processes, volume 80

    Emmanuel Rio et al. Asymptotic theory of weakly dependent random processes, volume 80. Springer, 2017

  21. [26]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  22. [27]

    Transformers learn in-context by gradient descent

    Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023

  23. [28]

    Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning

    Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36: 0 15614--15638, 2023

  24. [29]

    Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

    Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019

  25. [30]

    An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021

  26. [31]

    Learning to break the loop: Analyzing and mitigating repetitions for neural text generation

    Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35: 0 3082--3095, 2022

  27. [32]

    Trained transformers learn linear models in-context

    Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024