pith. sign in

arxiv: 2606.25010 · v1 · pith:QTH2KVGYnew · submitted 2026-06-23 · 💻 cs.LG · cs.CL

Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns

Pith reviewed 2026-06-25 23:40 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords emergent capabilitiessparse attention patternstransformer modelsin-context learningsynthetic datasetsattention headsstochastic emergencecellular automata
0
0 comments X

The pith

Emergent capabilities arise stochastically when transformers abruptly learn task-relevant sparse attention patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that capabilities such as pattern completion and indirect object identification do not improve smoothly with scale but appear abruptly during training when the model suddenly masters certain attention patterns. Larger models tend to acquire these patterns earlier on average, and the patterns themselves are harder to learn when they are sparse or span longer contexts. Experiments on synthetic linear map and cellular automata tasks isolate how the number of attention heads drives learning efficiency while wider heads give diminishing returns, and alternative architectures like MLP-Mixer can handle complex patterns better than standard transformers.

Core claim

Emergent capabilities arise stochastically throughout training, with larger models acquiring them earlier on average. The emergence of capabilities such as pattern completion and indirect object identification corresponds to the abrupt learning of task-relevant attention patterns. To isolate this phenomenon, the authors train transformer models on synthetic linear map and cellular automata datasets and show that the difficulty of learning attention patterns depends on context length and pattern sparsity. Scaling the number of attention heads improves learning efficiency on these tasks, while increasing the head dimension yields diminishing returns past a minimum capacity.

What carries the argument

Abrupt learning of sparse task-relevant attention patterns during transformer training.

If this is right

  • Capabilities emerge at stochastic points in training rather than at predictable scales.
  • Larger models acquire emergent capabilities earlier because they learn the required attention patterns faster on average.
  • The difficulty of acquiring an attention pattern grows with longer context lengths and sparser patterns.
  • Increasing the number of attention heads improves learning efficiency more reliably than increasing head dimension beyond a minimum size.
  • Architectures without standard attention, such as MLP-Mixer, can outperform transformers when the required patterns are complex.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If attention pattern discovery is the bottleneck, interventions that bias training toward sparse patterns could shift emergence to earlier points in training.
  • The stochastic timing implies that repeated runs with different random seeds are needed to map when a given capability reliably appears.
  • Synthetic tasks of this form could be used to forecast which capabilities will emerge at which model sizes before full pretraining.
  • If the mechanism generalizes, then monitoring attention pattern formation during training might serve as an early indicator of downstream capability acquisition.

Load-bearing premise

The synthetic linear map and cellular automata datasets sufficiently capture the mechanisms responsible for emergence of capabilities in real language model pretraining on natural text.

What would settle it

A controlled experiment on a real language model in which a downstream capability emerges without an observable abrupt shift to the corresponding attention pattern, or in which the pattern is learned but the capability remains absent.

Figures

Figures reproduced from arXiv: 2606.25010 by Andrew Gordon Wilson, Pavel Izmailov, Shikai Qiu, Vatsal Baherwani, Zixi Chen.

Figure 1
Figure 1. Figure 1: Language model capabilities emerge randomly and abruptly due to learning core attention patterns. Left: models with different initialization seeds learn a repetition task at random points throughout training; larger models consistently learn faster on average (dashed line). Models marked with a grey X did not solve the task by the end of training. Right: the model’s correct token probability abruptly spike… view at source ↗
Figure 2
Figure 2. Figure 2: Random emergence across three tasks. For each task, we show from left to right: (a) the average number of steps required for a capability to emerge decreases with model scale; (b) for a single training run, the model’s probability for the correct token (highlighted in green) increases sharply in a small interval which we consider the point of emergence; (c) emergence coincides with learning interpretable a… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of transition dynamics for linear map and cellular automata datasets. Cells highlighted in orange indicate the relevant information for computing a single cell of the next state. Linear map state transitions compute the parity of a sparse subset of the previous state for each cell. Cellular automata state transitions depend on a deterministic lookup table indexed by a local window of cells. Learni… view at source ↗
Figure 4
Figure 4. Figure 4: Abrupt learning on the linear map task corresponds to learning rows of the ground￾truth matrix. We plot the aggregate loss curve (dashed) decomposed into loss on individual output tokens (light purple) for a 1-layer transformer. We highlight one loss jump in bold and find that it corresponds to a drop in entropy for two attention heads. We plot the averaged attention maps for these two heads at initializat… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics become abrupt with longer context and medium sparsity. We average training loss over three models trained with different seeds. (a) Models can always learn very sparse or very dense attention patterns in the linear map task, but as we increase the state size medium sparsity patterns become more difficult and unlearnable by S = 32. (b) Given a fixed token batch size, increasing the state s… view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MLP-Mixer outperforms a transformer on the sparse linear map task, while under￾performing on cellular automata. As we increase the sparsity for the linear map task with state size S = 16, the transformer model struggles to learn the task while the MLP-Mixer achieves much lower loss. Note s = 7 corresponds to a maximally difficult attention pattern, as illustrated in Figure 5a. H = 128 heads and head dimens… view at source ↗
Figure 8
Figure 8. Figure 8: Loss jumps in the linear map task correspond to drops in attention entropy. In a short window of training, we see (a) the loss for a single output position drops to zero abruptly, and (b) the entropy of two attention heads drops at the same time. Upon inspecting these attention heads in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Attention intervention enables near-immediate learning on the linear map and cellular automata tasks. We bias the attention scores for every attention head based on a ground-truth pattern corresponding to each task, and find that in every case the subsequent learning dynamics are no longer abrupt. B.4 What Variables Affect Loss Plateau Length? In Section 3.3 we showed that long context lengths exacerbate l… view at source ↗
Figure 10
Figure 10. Figure 10: Context length is the dominating factor in loss plateau length. We adjust cellular automata parameters S, T, N, k, C, W and track the length of the initial loss plateau. In each grid cell, we either fix the state size to S = 64 and vary trajectory length T, or fix T = 16 and vary S. In all cases, we adjust the sample batch size B so the total number of tokens per step S · T · B remains fixed. Black lines … view at source ↗
Figure 11
Figure 11. Figure 11: Increasing number of layers hurts linear map performance but improves cellular automata performance. Bold lines indicate average loss over three random seeds. A 1-layer model reliably solves the linear map task, but increasing the number of layers either has no effect or dramatically worsens performance. An 8 or 12 layer model cannot solve the task after 10,000 training steps. In contrast, a single layer … view at source ↗
Figure 12
Figure 12. Figure 12: Alternative architectures underperform relative to transformers on both cellular automata and linear map tasks. Aside from MLP-Mixer, which only outperforms transformers on the linear map task, all other architectures we evaluate yield subpar performance on both tasks. Notably, as we increase the state size N from 16 to 32 for the linear map task, MLP-Mixer still makes reasonable progress during training … view at source ↗
Figure 13
Figure 13. Figure 13: Pre-pretraining improves emergence speed for in-context capabilities. (a) Prepretrain￾ing with Dyck formal languages (green) or neural cellular automata (red) elicits in-context repetition capability earlier than training from scratch (grey) across four random seeds. The uplift persists even when we account for the additional pre-pretraining tokens (shaded bars). However, PPT does not consistently improve… view at source ↗
read the original abstract

Neural scaling laws for transformer language models predict smooth improvements in pretraining loss with increasing parameters, but downstream capabilities such as in-context learning are known to emerge abruptly past a certain model scale. In this paper, we show that emergent capabilities arise stochastically throughout training, with larger models acquiring them earlier on average. We demonstrate that the emergence of capabilities such as pattern completion and indirect object identification corresponds to the abrupt learning of task-relevant attention patterns. To isolate this phenomenon, we train transformer models on synthetic linear map and cellular automata datasets, and we show that the difficulty of learning attention patterns depends on context length and pattern sparsity. Moreover, scaling the number of attention heads improves learning efficiency on our synthetic tasks, while increasing the head dimension yields diminishing returns past a minimum capacity. We additionally investigate architectures with alternative attention mechanisms, showing that MLP-Mixer outperforms a transformer on linear map tasks with complex attention patterns. Our findings provide a mechanistic insight into emergence, showing that downstream capabilities arise abruptly due to the intrinsic difficulty of learning sparse attention patterns in transformer models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that emergent capabilities in transformers arise stochastically throughout training (with larger models acquiring them earlier on average) because of the abrupt learning of sparse task-relevant attention patterns. This is demonstrated on two hand-designed synthetic datasets (linear maps and cellular automata) where the authors measure attention pattern acquisition timing, vary context length and sparsity, compare head count vs. head dimension, and show MLP-Mixer outperforming transformers on complex patterns. The work positions these controlled experiments as providing a mechanistic explanation for abrupt capability emergence despite smooth loss scaling in real language models.

Significance. If the reported correspondence between capability emergence and attention-pattern acquisition is robust within the synthetic regime, the work supplies a concrete, testable mechanism that could explain why downstream capabilities appear suddenly. The synthetic construction permits direct inspection of attention heads and precise timing measurements that are difficult in natural-text pretraining, and the architectural ablations (head count vs. dimension, MLP-Mixer comparison) generate clear predictions. These are genuine strengths. However, because all quantitative evidence is obtained on tasks whose inputs and targets are explicitly engineered around sparse linear or rule-based structures, the transfer of the mechanism to unstructured natural-language pretraining remains untested.

major comments (2)
  1. [Experimental sections] Experimental sections (synthetic dataset construction and results): all quantitative claims about stochastic timing, abruptness, and attention-head specialization are obtained exclusively on linear-map and cellular-automata distributions whose targets are defined by sparse maps or rules. No ablation that removes the engineered sparsity (e.g., dense random targets or unstructured sequences) is reported, so it is impossible to determine whether the observed stochastic emergence is caused by the sparsity or is an artifact of the task design. This directly affects the central mechanistic claim.
  2. [Discussion / conclusion] Discussion / conclusion: the manuscript asserts that the synthetic findings supply a mechanistic insight into emergence in language models, yet no experiments on natural text, no measurement of attention specialization during real pretraining, and no comparison of emergence timing distributions between synthetic and natural settings are provided. Without such evidence the transfer argument remains unsupported and is load-bearing for the title and abstract framing.
minor comments (1)
  1. [Abstract] The abstract states that experiments 'support the correspondence' but supplies no numerical values, error bars, or statistical tests; if the full paper likewise omits these, the strength of the stochasticity and timing claims cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the strengths of our controlled synthetic experiments. We address the major comments point by point below, and we are prepared to make revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental sections] Experimental sections (synthetic dataset construction and results): all quantitative claims about stochastic timing, abruptness, and attention-head specialization are obtained exclusively on linear-map and cellular-automata distributions whose targets are defined by sparse maps or rules. No ablation that removes the engineered sparsity (e.g., dense random targets or unstructured sequences) is reported, so it is impossible to determine whether the observed stochastic emergence is caused by the sparsity or is an artifact of the task design. This directly affects the central mechanistic claim.

    Authors: The synthetic tasks were deliberately constructed with sparse linear maps and rule-based cellular automata to enable direct measurement of attention pattern acquisition and to model the kind of sparse structures that may underlie emergent capabilities in more complex domains. Our central claim is that in these settings, capabilities emerge stochastically due to the difficulty of learning the sparse attention patterns. We acknowledge that without an ablation using dense targets, it is difficult to isolate sparsity as the causal factor versus other aspects of the task design. To address this, we will include additional experiments with dense random targets in the revised manuscript to compare emergence behavior. revision: yes

  2. Referee: [Discussion / conclusion] Discussion / conclusion: the manuscript asserts that the synthetic findings supply a mechanistic insight into emergence in language models, yet no experiments on natural text, no measurement of attention specialization during real pretraining, and no comparison of emergence timing distributions between synthetic and natural settings are provided. Without such evidence the transfer argument remains unsupported and is load-bearing for the title and abstract framing.

    Authors: We position the work as providing a mechanistic insight derived from controlled synthetic experiments where the relevant variables can be precisely manipulated and measured. The abstract and discussion frame this as an explanation for the phenomenon of abrupt emergence despite smooth loss curves, using the synthetic regime as a testbed. We do not present direct evidence from natural language pretraining. We will revise the discussion and abstract to emphasize that these results offer a hypothesis for the mechanism in language models, with validation on natural text left as important future work, thereby clarifying the scope of the claims. revision: partial

Circularity Check

0 steps flagged

No circularity: emergence measured independently via task accuracy and attention inspection on synthetic tasks

full rationale

The paper trains transformers on explicitly constructed synthetic linear-map and cellular-automata tasks, measures capability emergence by downstream task performance, and separately inspects when task-relevant attention patterns appear. These two quantities are not defined in terms of each other; the correspondence is an empirical observation rather than a definitional identity. No equations, fitted parameters, or self-citations are shown to reduce the central claim to its inputs by construction. The synthetic construction supplies an independent testbed rather than smuggling the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the chosen synthetic tasks isolate the relevant attention-learning difficulty without introducing artifacts that differ from natural language training. No free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5723 in / 1146 out tokens · 19683 ms · 2026-06-25T23:40:57.105833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 13 linked inside Pith

  1. [1]

    xlstm: Extended long short-term memory, 2024

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory, 2024. URL https://arxiv.org/abs/2405.04517

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling, 2023

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

  3. [3]

    Leavitt, and Naomi Saphra

    Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms, 2025. URL https://arxiv.org/abs/2309.07311

  4. [4]

    What happens during the loss plateau? understanding abrupt learning in transformers, 2025

    Pulkit Gopalani and Wei Hu. What happens during the loss plateau? understanding abrupt learning in transformers, 2025. URL https://arxiv.org/abs/2506.13688

  5. [5]

    Abrupt learning in transformers: A case study on matrix completion, 2024

    Pulkit Gopalani, Ekdeep Singh Lubana, and Wei Hu. Abrupt learning in transformers: A case study on matrix completion, 2024. URL https://arxiv.org/abs/2410.22244

  6. [6]

    Mamba: Linear-time sequence modeling with selective state spaces, 2024

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

  7. [7]

    How to use and interpret activation patching, 2024

    Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching, 2024. URL https://arxiv.org/abs/2404.15255

  8. [8]

    Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017. URL https://arxiv.org/abs/1712.00409

  9. [9]

    Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen

    Michael Y. Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen. Between circuits and chomsky: Pre-pretraining on formal languages imparts linguistic biases, 2025. URL https://arxiv.org/abs/2502.19249

  10. [10]

    Hidden breakthroughs in language model training, 2026

    Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra. Hidden breakthroughs in language model training, 2026. URL https://arxiv.org/abs/2506.15872

  11. [11]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

  12. [12]

    Training language models via neural cellular automata, 2026

    Dan Lee, Seungwook Han, Akarsh Kumar, and Pulkit Agrawal. Training language models via neural cellular automata, 2026. URL https://arxiv.org/abs/2603.10055

  13. [13]

    Michaud, Ziming Liu, Uzay Girit, and Max Tegmark

    Eric J. Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling, 2024. URL https://arxiv.org/abs/2303.13506

  14. [14]

    Progress measures for grokking via mechanistic interpretability, 2023

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URL https://arxiv.org/abs/2301.05217

  15. [15]

    In-context learning and induction heads, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  16. [16]

    Resurrecting recurrent neural networks for long sequences, 2023

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences, 2023. URL https://arxiv.org/abs/2303.06349

  17. [17]

    Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, ...

  18. [18]

    Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/abs/2201.02177

  19. [19]

    Are emergent abilities of large language models a mirage?, 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage?, 2023. URL https://arxiv.org/abs/2304.15004

  20. [20]

    Mlp-mixer: An all-mlp architecture for vision, 2021

    Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision, 2021. URL https://arxiv.org/abs/2105.01601

  21. [21]

    Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022. URL https://arxiv.org/abs/2211.00593

  22. [22]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022. URL https://arxiv.org/abs/2206.07682

  23. [23]

    Efficient streaming language models with attention sinks, 2024

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024. URL https://arxiv.org/abs/2309.17453

  24. [24]

    Gated delta networks: Improving mamba2 with delta rule, 2025

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025. URL https://arxiv.org/abs/2412.06464

  25. [25]

    Random scaling of emergent capabilities, 2026

    Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, and Naomi Saphra. Random scaling of emergent capabilities, 2026. URL https://arxiv.org/abs/2502.17356

  26. [26]

    Lampinen, and Stephanie C

    Nicolas Zucchet, Francesco d'Angelo, Andrew K. Lampinen, and Stephanie C. Y. Chan. The emergence of sparse attention: impact of data distribution and benefits of repetition, 2025. URL https://arxiv.org/abs/2505.17863