Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

Diya Sreedhar; Juliana Li

arxiv: 2606.26050 · v1 · pith:53Z6YBKSnew · submitted 2026-06-24 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· cs.CL

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

Juliana Li , Diya Sreedhar This is my paper

Pith reviewed 2026-06-25 19:02 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AIcs.CL

keywords natural ungrokkingpretraining dynamicsrule survivalsupport frequencyasymmetric controllanguage model forgettingdisplacement

0 comments

The pith

Support frequency in the training stream decides which rules survive pretraining, and control is asymmetric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Midway through pretraining a language model can acquire a rule such as pronoun-gender agreement and generalize on held-out probes, only for the same rule to collapse to near-zero performance later even though its evidence remains in the data. The paper shows that a single corpus statistic predicts which rules persist: the frequency with which the training stream presents the rule winning against alternatives. Across multiple corpora, training budgets, and random seeds, higher support frequency sustains the rule while lower support leads to displacement by a competing surface pattern. The data-to-parameter ratio affects only how far a losing rule falls, not whether it survives. Interventions confirm asymmetry: flipping support to counter-evidence reliably kills the rule in a dose-dependent way, yet restoring support up to hundreds of times the natural level produces no recovery.

Core claim

The paper establishes that natural ungrokking occurs when a learned rule collapses during continued pretraining despite persistent evidence in the corpus, and that survival is governed by support frequency—the rate at which the training stream shows the rule winning. This frequency determines the rule's fate across un-intervened runs, while the forgetting itself is a displacement in which a competing pattern's log-probability margin crosses zero near the behavioral collapse. Control is asymmetric: the same edit that introduces counter-evidence destroys the rule monotonically, but injecting support back fails to restore it even at 450 times the sustaining level. The pattern also appears in pu

What carries the argument

support frequency: the proportion of training examples in which a given rule wins against competitors, which determines whether the rule is retained or displaced.

If this is right

The data-to-parameter ratio modulates collapse depth for doomed rules but does not alter which rules survive.
The same emerge-then-collapse dynamics appear in Pythia checkpoints with depth ordered by model scale.
Introducing counter-evidence kills a rule with monotone dose-response across unrelated rules.
Injecting support fails to recover a collapsed rule even at 450 times the natural sustaining level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curating training streams to raise support frequency for desired rules could selectively preserve specific behaviors without changing model scale.
The same frequency-based competition may explain why some facts persist while others are overwritten in larger web-trained models.
The asymmetry suggests prevention of unwanted rule acquisition during training is more tractable than post-hoc correction.

Load-bearing premise

The observed collapse on held-out probes reflects genuine displacement of the rule by a competing pattern rather than changes in probe sensitivity or other unmeasured dynamics.

What would settle it

A rule whose support frequency exceeds that of its competitor still collapses on probes, or high-level support injection after collapse restores probe performance above baseline.

Figures

Figures reproduced from arXiv: 2606.26050 by Diya Sreedhar, Juliana Li.

**Figure 1.** Figure 1: The focal pronoun-gender rule emerges and then collapses under web pretraining, and an internal margin crosses zero as it does. (a) Held-out conflict accuracy. On TinyStories (green: seed mean, min–max band) the rule survives at ceiling; on web the bold vermillion trace is the seed the abstract quotes (0.94 by step 925, gone by the end), with the two faint traces rising and collapsing the same way. Dotted … view at source ↗

**Figure 3.** Figure 3: The transient signature in public checkpoints, same frozen battery: held-out conflict accuracy for the focal rule across Pythia revisions (70M–1.4B) and OLMo-1B, frozen k=3 smoothing, log-axis steps. The rule emerges early in every model; end-oftraining survival is ordered by scale (PUB3, Spearman ρ = 0.894), the smallest model falling deepest and collapse gone by 410M. Agree controls of the collapsing m… view at source ↗

**Figure 4.** Figure 4: The two intervention directions, each on its own dose axis with the un-intervened base at dose zero. Top strips: per-seed behavioral outcome (colors as in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The second mechanism measure. (a) Exact decomposition of the pre-softcap she–he gap on the conflict prefixes, per web seed: the contextual contribution Cctx (vermillion) carries the margin and ends negative, the direct embedding-path contribution Cdir (grey) stays near zero throughout; vertical line = behavioral collapse onset. The measure requires no peak-height gate, so all three seeds are shown at full … view at source ↗

read the original abstract

Midway through an ordinary pretraining run, a small language model learns the pronoun-gender rule: cued with a girl's name ("Sue cried because"), it resolves the next pronoun to she, generalizing to held-out probes (0.94 by step 925). By step 3,500 the same model scores near zero on the same probes, although the rule's evidence is still in the training data. We call this within-run reversal natural ungrokking: the corpus decides, with no trace in the loss curve, which learned rules a model keeps. Which rules survive is predictable from one corpus statistic: how often the training stream shows the rule winning. Across un-intervened runs (two corpora, three budgets, three seeds), support frequency decides a rule's fate; the data-to-parameter ratio only modulates how deeply a doomed rule falls. The same emerge-then-collapse dynamics appear in public Pythia checkpoints, collapse depth ordered by model scale as predicted. The forgetting is a displacement: a competing surface pattern out-competes the rule, and the log-probability margin between them crosses zero within 100 training steps of the behavioral collapse. Control over this fate is asymmetric: the same edit that destroys a rule on demand cannot restore it. Flipping support to counter-evidence in place kills the rule with monotone dose-response in two unrelated rules; but injecting support back, even to 450 times the level that naturally sustains it, buys no recovery. Every confirmatory threshold and prediction was pre-registered before the data it governed was read.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a within-run reversal in rule acquisition during pretraining that tracks a simple corpus frequency statistic, with asymmetric intervention results, but the collapse interpretation rests on behavioral probes whose sensitivity is not fully controlled.

read the letter

The main new observation is that a model can acquire a pronoun-gender rule early in pretraining, then lose it on held-out probes while the supporting data remains in the stream; this reversal is predictable from how often the rule wins in the training data, and the same pattern appears in Pythia checkpoints ordered by scale. The work also shows that artificially flipping support frequency destroys the rule in a dose-dependent way, but boosting support back does not restore it. Those two results, plus the pre-registration of thresholds, are the concrete contributions.

The experiments run across two corpora, three budgets, three seeds, and include the public checkpoints, which is more than usual for this kind of claim. The asymmetry result is the part that would actually change how people think about editing training dynamics.

The soft spot is the interpretation of the behavioral collapse. The abstract states that the rule's evidence stays in the data, yet the drop to near-zero on probes could come from probe formatting drift, representation shift, or contamination rather than genuine displacement by a competing pattern. No re-test on the original training instances or ablation of probe wording is described in the provided text, so the central claim that support frequency decides the rule's fate still needs that check. The log-prob margin crossing zero is reported, but without those controls it remains one possible reading.

This is the kind of paper a reading group could discuss for the experimental design and the asymmetry finding, even if the displacement story needs tightening. It is worth sending to referees because the pre-registered predictions and the scale replication give it enough grounding to be worth a careful review, though the authors should expect questions on the probe validity.

Referee Report

2 major / 2 minor

Summary. The manuscript reports the phenomenon of 'natural ungrokking' during language model pretraining: models learn rules such as pronoun-gender resolution (reaching 0.94 accuracy on held-out probes) that then collapse to near-zero despite the rule's evidence remaining in the training data. Survival of rules is claimed to be predictable from a single independent corpus statistic (support frequency, i.e., how often the training stream shows the rule winning), with the data-to-parameter ratio only modulating collapse depth; this is shown across two corpora, three budgets, three seeds, replicated on public Pythia checkpoints, and with asymmetric control via support interventions. All confirmatory thresholds and predictions were pre-registered.

Significance. If the displacement interpretation holds, the work identifies a corpus-statistic predictor for rule persistence that operates without loss-curve signatures and demonstrates asymmetric editability of learned rules. Credit is due for the pre-registered predictions, consistent results across un-intervened runs and public checkpoints, and the explicit measurement of log-probability margins crossing zero near collapse.

major comments (2)

[Abstract] Abstract: the central claim that behavioral collapse on held-out probes reflects genuine displacement (rather than probe sensitivity drift or representation shift) is load-bearing for both the support-frequency predictability result and the asymmetric-control findings. The abstract asserts that 'the rule's evidence remains in the training data' and that 'the log-probability margin between them crosses zero within 100 training steps,' yet provides no explicit controls such as re-testing on original training instances containing the evidence or ablation of probe formatting to rule out changes in probe sensitivity.
[Intervention experiments] Intervention results: the monotone dose-response for rule destruction via counter-evidence injection is presented as decisive, but the manuscript does not report whether the edited support frequency was measured on the same independent corpus statistic used for the un-intervened runs, leaving open whether the asymmetry result could be an artifact of how the interventions alter the overall training distribution.

minor comments (2)

[Abstract] The abstract states results are consistent across 'two corpora, three budgets, three seeds' but does not specify the exact statistical test or threshold used for 'decides a rule's fate'; this detail belongs in the main text or a methods appendix.
[Pythia replication section] Replication on Pythia checkpoints is a strength, but the manuscript should clarify whether the support-frequency statistic was computed from the original training stream or approximated from the released data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the need for explicit validation of the displacement claim and intervention measurements. We address each major comment below and commit to revisions where the manuscript requires clarification or additional reporting.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that behavioral collapse on held-out probes reflects genuine displacement (rather than probe sensitivity drift or representation shift) is load-bearing for both the support-frequency predictability result and the asymmetric-control findings. The abstract asserts that 'the rule's evidence remains in the training data' and that 'the log-probability margin between them crosses zero within 100 training steps,' yet provides no explicit controls such as re-testing on original training instances containing the evidence or ablation of probe formatting to rule out changes in probe sensitivity.

Authors: We agree the abstract is concise and does not enumerate controls. The main text supports the displacement interpretation via direct measurement of the log-probability margin between the rule and competing surface pattern, which crosses zero near collapse, and by confirming the rule's evidence persists in the corpus. However, the manuscript does not report re-testing on original training instances or probe-formatting ablations. We will add these controls (re-testing on evidence-containing instances and probe ablations) to the revised manuscript and update the abstract to reference them. This strengthens the claim without altering the pre-registered predictions. revision: yes
Referee: [Intervention experiments] Intervention results: the monotone dose-response for rule destruction via counter-evidence injection is presented as decisive, but the manuscript does not report whether the edited support frequency was measured on the same independent corpus statistic used for the un-intervened runs, leaving open whether the asymmetry result could be an artifact of how the interventions alter the overall training distribution.

Authors: The interventions were constructed to modify support frequency using the identical definition and independent corpus statistic applied to the un-intervened runs. Post-edit support frequencies were verified on that same statistic to confirm the target levels were achieved. The asymmetry (destruction succeeds; restoration fails even at 450x natural support) is therefore tied to the same predictor. The manuscript does not explicitly state this verification step for the edited runs. We will add a dedicated paragraph in the methods and results sections reporting the post-intervention measurements on the independent corpus to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: support frequency measured independently from corpus; survival outcome not used to define or fit it.

full rationale

The paper's central claim is that rule survival is predicted by an independently measured corpus statistic (support frequency, i.e., how often the training stream shows the rule winning). This statistic is extracted directly from the training data stream before observing outcomes. Behavioral collapse on probes is treated as the dependent variable, not fed back into the statistic. Pre-registration of thresholds and predictions is stated explicitly. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain. The derivation remains self-contained against external benchmarks (multiple corpora, seeds, Pythia checkpoints).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters or invented entities are stated. The central claim rests on the domain assumption that held-out probes measure rule retention independently of the frequency statistic.

axioms (1)

domain assumption Held-out behavioral probes accurately reflect whether the model has retained the pronoun-gender rule.
The abstract uses probe accuracy (0.94 then near zero) as the measure of rule presence and absence.

pith-pipeline@v0.9.1-grok · 5821 in / 1329 out tokens · 38673 ms · 2026-06-25T19:02:00.100483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 6 canonical work pages

[1]

Critical learning periods in deep networks

Achille, A., Rovere, M., and Soatto, S. Critical learning periods in deep networks. In International Conference on Learning Representations, 2019. arXiv:1711.08856

Pith/arXiv arXiv 2019
[2]

A., Purohit, S., Prashanth, U

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, 2023. arXiv:2304.01373

Pith/arXiv arXiv 2023
[3]

C., Xu, P., Och, F

Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.\ 858--867, 2007

2007
[4]

Chang, T. A. and Bergen, B. K. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10: 0 1--16, 2022. doi:10.1162/tacl_a_00444

work page doi:10.1162/tacl_a_00444 2022
[5]

and Tu, Zhuowen and Bergen, Benjamin K

Chang, T. A., Tu, Z., and Bergen, B. K. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability. Transactions of the Association for Computational Linguistics, 12: 0 1346--1362, 2024. doi:10.1162/tacl_a_00708

work page doi:10.1162/tacl_a_00708 2024
[6]

L., and Saphra, N

Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M. L., and Saphra, N. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. In International Conference on Learning Representations, 2024. arXiv:2309.07311

arXiv 2024
[7]

The grammar-learning trajectories of neural language models

Choshen, L., Hacohen, G., Weinshall, D., and Abend, O. The grammar-learning trajectories of neural language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022
[8]

Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training

Diao, S., Yang, Y., Fu, Y., Dong, X., Su, D., Kliegl, M., Chen, Z., Belcak, P., Suhara, Y., Yin, H., Patwary, M., Lin, Y., Kautz, J., and Molchanov, P. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training. arXiv preprint arXiv:2504.13161, 2025

Pith/arXiv arXiv 2025
[9]

and Li, Y

Eldan, R. and Li, Y. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

Pith/arXiv arXiv 2023
[10]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html

2021
[11]

G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al

Gemma Team , Rivi \`e re, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

Pith/arXiv arXiv 2024
[12]

H., Ivison, H., Magnusson, I., Wang, Y., et al

Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., et al. Olmo: Accelerating the science of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. arXiv:2402.00838

Pith/arXiv arXiv 2024
[13]

A., Welbl, J., Clark, A., et al

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

Pith/arXiv arXiv 2022
[14]

The developmental landscape of in-context learning

Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. The developmental landscape of in-context learning. arXiv preprint arXiv:2402.02364, 2024

arXiv 2024
[15]

Measuring forgetting of memorized training examples

Jagielski, M., Thakkar, O., Tram \`e r, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., and Zhang, C. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022

arXiv 2022
[16]

Muon: An optimizer for hidden layers in neural networks

Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., and Bernstein, J. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/, 2024

2024
[17]

Large language models struggle to learn long-tail knowledge

Kandpal, N., Deng, H., Roberts, A., Wallace, E., and Raffel, C. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023

2023
[18]

nanochat

Karpathy, A. nanochat. https://github.com/karpathy/nanochat, 2025

2025
[19]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017. arXiv:1612.00796

arXiv 2017
[20]

Sparse crosscoders for cross-layer features and model diffing

Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., and Olah, C. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/crosscoders/index.html

2024
[21]

J., Tegmark, M., and Williams, M

Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., and Williams, M. Towards understanding grokking: An effective theory of representation learning. In Advances in Neural Information Processing Systems, 2022

2022
[22]

J., and Tegmark, M

Liu, Z., Michaud, E. J., and Tegmark, M. Omnigrok: Grokking beyond algorithmic data. In International Conference on Learning Representations, 2023. arXiv:2210.01117

arXiv 2023
[23]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019
[24]

Gender bias in neural natural language processing

Lu, K., Mardziel, P., Wu, F., Amancharla, P., and Datta, A. Gender bias in neural natural language processing. In Logic, Language, and Security. Springer, 2020

2020
[25]

Locating and editing factual associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, 2022

2022
[26]

and Mahowald, K

Misra, K. and Mahowald, K. Language models learn rare phenomena from less rare phenomena: The case of the missing AANNs . arXiv preprint arXiv:2403.19827, 2024

arXiv 2024
[27]

M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C

Muennighoff, N., Rush, A. M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models. In Advances in Neural Information Processing Systems, 2023. arXiv:2305.16264

Pith/arXiv arXiv 2023
[28]

Grokking of Hierarchical Structure in Vanilla Transformers

Murty, S., Sharma, P., Andreas, J., and Manning, C. Grokking of hierarchical structure in vanilla transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 439--448, 2023. doi:10.18653/v1/2023.acl-short.38

work page doi:10.18653/v1/2023.acl-short.38 2023
[29]

Progress measures for grokking via mechanistic interpretability

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023. arXiv:2301.05217

Pith/arXiv arXiv 2023
[30]

In-context learning and induction heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. arXiv:2209.11895

Pith/arXiv arXiv 2022
[31]

Filtered Corpus Training ( F i CT ) Shows that Language Models Can Generalize from Indirect Evidence

Patil, A., Jumelet, J., Chiu, Y. Y., Lapastora, A., Shen, P., Wang, L., Willrich, C., and Steinert-Threlkeld, S. Filtered corpus training ( FiCT ) shows that language models can generalize from indirect evidence. Transactions of the Association for Computational Linguistics, 12: 0 1597--1615, 2024. doi:10.1162/tacl_a_00720

work page doi:10.1162/tacl_a_00720 2024
[32]

Grokking: Generalization beyond overfitting on small algorithmic datasets

Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022
[33]

Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, 2023

Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, 2023. arXiv:2304.15004

arXiv 2023
[34]

Neural machine translation of rare words with subword units

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

2016
[35]

K., Chan, S

Singh, A. K., Chan, S. C., Moskovitz, T., Grant, E., Saxe, A. M., and Hill, F. The transient nature of emergent in-context learning in transformers. In Advances in Neural Information Processing Systems, 2023. arXiv:2311.08360

arXiv 2023
[36]

H., Zettlemoyer, L., and Aghajanyan, A

Tirumala, K., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems, 2022. arXiv:2205.10770

arXiv 2022
[37]

T., Trischler, A., Bengio, Y., and Gordon, G

Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y., and Gordon, G. J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. arXiv:1812.05159

arXiv 2019
[38]

Explaining grokking through circuit efficiency

Varma, V., Shah, R., Kenton, Z., Kram \'a r, J., and Kumar, R. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023

arXiv 2023
[39]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

2017
[40]

Investigating gender bias in language models using causal mediation analysis

Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, 2020

2020
[41]

Differentiation and specialization of attention heads via the refined local learning coefficient

Wang, G., Hoogland, J., van Wingerden, S., Furman, Z., and Murfet, D. Differentiation and specialization of attention heads via the refined local learning coefficient. arXiv preprint arXiv:2410.02984, 2024

arXiv 2024
[42]

Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020 a . arXiv:1912.00582

arXiv 2020
[43]

Warstadt, A., Zhang, Y., Li, X., Liu, H., and Bowman, S. R. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp.\ 217--235, 2020 b . doi:10.18653/v1/2020.emnlp-main.16

work page doi:10.18653/v1/2020.emnlp-main.16 2020
[44]

Frequency effects on syntactic rule learning in transformers

Wei, J., Garrette, D., Linzen, T., and Pavlick, E. Frequency effects on syntactic rule learning in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 932--948, 2021. doi:10.18653/v1/2021.emnlp-main.72

work page doi:10.18653/v1/2021.emnlp-main.72 2021
[45]

H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. arXiv:2206.07682

Pith/arXiv arXiv 2022
[46]

M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q

Xie, S. M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q. V., Ma, T., and Yu, A. W. Doremi: Optimizing data mixtures speeds up language model pretraining. In Advances in Neural Information Processing Systems, 2023. arXiv:2305.10429

arXiv 2023
[47]

Gender bias in coreference resolution: Evaluation and debiasing methods

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of NAACL-HLT, 2018

2018
[48]

How do language models learn facts? dynamics, curricula and hallucinations

Zucchet, N., Bornschein, J., Chan, S., Lampinen, A., Pascanu, R., and De, S. How do language models learn facts? dynamics, curricula and hallucinations. arXiv preprint arXiv:2503.21676, 2025

arXiv 2025

[1] [1]

Critical learning periods in deep networks

Achille, A., Rovere, M., and Soatto, S. Critical learning periods in deep networks. In International Conference on Learning Representations, 2019. arXiv:1711.08856

Pith/arXiv arXiv 2019

[2] [2]

A., Purohit, S., Prashanth, U

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, 2023. arXiv:2304.01373

Pith/arXiv arXiv 2023

[3] [3]

C., Xu, P., Och, F

Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.\ 858--867, 2007

2007

[4] [4]

Chang, T. A. and Bergen, B. K. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10: 0 1--16, 2022. doi:10.1162/tacl_a_00444

work page doi:10.1162/tacl_a_00444 2022

[5] [5]

and Tu, Zhuowen and Bergen, Benjamin K

Chang, T. A., Tu, Z., and Bergen, B. K. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability. Transactions of the Association for Computational Linguistics, 12: 0 1346--1362, 2024. doi:10.1162/tacl_a_00708

work page doi:10.1162/tacl_a_00708 2024

[6] [6]

L., and Saphra, N

Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M. L., and Saphra, N. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. In International Conference on Learning Representations, 2024. arXiv:2309.07311

arXiv 2024

[7] [7]

The grammar-learning trajectories of neural language models

Choshen, L., Hacohen, G., Weinshall, D., and Abend, O. The grammar-learning trajectories of neural language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022

[8] [8]

Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training

Diao, S., Yang, Y., Fu, Y., Dong, X., Su, D., Kliegl, M., Chen, Z., Belcak, P., Suhara, Y., Yin, H., Patwary, M., Lin, Y., Kautz, J., and Molchanov, P. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training. arXiv preprint arXiv:2504.13161, 2025

Pith/arXiv arXiv 2025

[9] [9]

and Li, Y

Eldan, R. and Li, Y. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

Pith/arXiv arXiv 2023

[10] [10]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html

2021

[11] [11]

G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al

Gemma Team , Rivi \`e re, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

Pith/arXiv arXiv 2024

[12] [12]

H., Ivison, H., Magnusson, I., Wang, Y., et al

Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., et al. Olmo: Accelerating the science of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. arXiv:2402.00838

Pith/arXiv arXiv 2024

[13] [13]

A., Welbl, J., Clark, A., et al

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

Pith/arXiv arXiv 2022

[14] [14]

The developmental landscape of in-context learning

Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. The developmental landscape of in-context learning. arXiv preprint arXiv:2402.02364, 2024

arXiv 2024

[15] [15]

Measuring forgetting of memorized training examples

Jagielski, M., Thakkar, O., Tram \`e r, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., and Zhang, C. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022

arXiv 2022

[16] [16]

Muon: An optimizer for hidden layers in neural networks

Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., and Bernstein, J. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/, 2024

2024

[17] [17]

Large language models struggle to learn long-tail knowledge

Kandpal, N., Deng, H., Roberts, A., Wallace, E., and Raffel, C. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023

2023

[18] [18]

nanochat

Karpathy, A. nanochat. https://github.com/karpathy/nanochat, 2025

2025

[19] [19]

A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017. arXiv:1612.00796

arXiv 2017

[20] [20]

Sparse crosscoders for cross-layer features and model diffing

Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., and Olah, C. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/crosscoders/index.html

2024

[21] [21]

J., Tegmark, M., and Williams, M

Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., and Williams, M. Towards understanding grokking: An effective theory of representation learning. In Advances in Neural Information Processing Systems, 2022

2022

[22] [22]

J., and Tegmark, M

Liu, Z., Michaud, E. J., and Tegmark, M. Omnigrok: Grokking beyond algorithmic data. In International Conference on Learning Representations, 2023. arXiv:2210.01117

arXiv 2023

[23] [23]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019

[24] [24]

Gender bias in neural natural language processing

Lu, K., Mardziel, P., Wu, F., Amancharla, P., and Datta, A. Gender bias in neural natural language processing. In Logic, Language, and Security. Springer, 2020

2020

[25] [25]

Locating and editing factual associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, 2022

2022

[26] [26]

and Mahowald, K

Misra, K. and Mahowald, K. Language models learn rare phenomena from less rare phenomena: The case of the missing AANNs . arXiv preprint arXiv:2403.19827, 2024

arXiv 2024

[27] [27]

M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C

Muennighoff, N., Rush, A. M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models. In Advances in Neural Information Processing Systems, 2023. arXiv:2305.16264

Pith/arXiv arXiv 2023

[28] [28]

Grokking of Hierarchical Structure in Vanilla Transformers

Murty, S., Sharma, P., Andreas, J., and Manning, C. Grokking of hierarchical structure in vanilla transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 439--448, 2023. doi:10.18653/v1/2023.acl-short.38

work page doi:10.18653/v1/2023.acl-short.38 2023

[29] [29]

Progress measures for grokking via mechanistic interpretability

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023. arXiv:2301.05217

Pith/arXiv arXiv 2023

[30] [30]

In-context learning and induction heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. arXiv:2209.11895

Pith/arXiv arXiv 2022

[31] [31]

Filtered Corpus Training ( F i CT ) Shows that Language Models Can Generalize from Indirect Evidence

Patil, A., Jumelet, J., Chiu, Y. Y., Lapastora, A., Shen, P., Wang, L., Willrich, C., and Steinert-Threlkeld, S. Filtered corpus training ( FiCT ) shows that language models can generalize from indirect evidence. Transactions of the Association for Computational Linguistics, 12: 0 1597--1615, 2024. doi:10.1162/tacl_a_00720

work page doi:10.1162/tacl_a_00720 2024

[32] [32]

Grokking: Generalization beyond overfitting on small algorithmic datasets

Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022

[33] [33]

Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, 2023

Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, 2023. arXiv:2304.15004

arXiv 2023

[34] [34]

Neural machine translation of rare words with subword units

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

2016

[35] [35]

K., Chan, S

Singh, A. K., Chan, S. C., Moskovitz, T., Grant, E., Saxe, A. M., and Hill, F. The transient nature of emergent in-context learning in transformers. In Advances in Neural Information Processing Systems, 2023. arXiv:2311.08360

arXiv 2023

[36] [36]

H., Zettlemoyer, L., and Aghajanyan, A

Tirumala, K., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems, 2022. arXiv:2205.10770

arXiv 2022

[37] [37]

T., Trischler, A., Bengio, Y., and Gordon, G

Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y., and Gordon, G. J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. arXiv:1812.05159

arXiv 2019

[38] [38]

Explaining grokking through circuit efficiency

Varma, V., Shah, R., Kenton, Z., Kram \'a r, J., and Kumar, R. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023

arXiv 2023

[39] [39]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

2017

[40] [40]

Investigating gender bias in language models using causal mediation analysis

Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, 2020

2020

[41] [41]

Differentiation and specialization of attention heads via the refined local learning coefficient

Wang, G., Hoogland, J., van Wingerden, S., Furman, Z., and Murfet, D. Differentiation and specialization of attention heads via the refined local learning coefficient. arXiv preprint arXiv:2410.02984, 2024

arXiv 2024

[42] [42]

Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020 a . arXiv:1912.00582

arXiv 2020

[43] [43]

Warstadt, A., Zhang, Y., Li, X., Liu, H., and Bowman, S. R. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp.\ 217--235, 2020 b . doi:10.18653/v1/2020.emnlp-main.16

work page doi:10.18653/v1/2020.emnlp-main.16 2020

[44] [44]

Frequency effects on syntactic rule learning in transformers

Wei, J., Garrette, D., Linzen, T., and Pavlick, E. Frequency effects on syntactic rule learning in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 932--948, 2021. doi:10.18653/v1/2021.emnlp-main.72

work page doi:10.18653/v1/2021.emnlp-main.72 2021

[45] [45]

H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. arXiv:2206.07682

Pith/arXiv arXiv 2022

[46] [46]

M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q

Xie, S. M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q. V., Ma, T., and Yu, A. W. Doremi: Optimizing data mixtures speeds up language model pretraining. In Advances in Neural Information Processing Systems, 2023. arXiv:2305.10429

arXiv 2023

[47] [47]

Gender bias in coreference resolution: Evaluation and debiasing methods

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of NAACL-HLT, 2018

2018

[48] [48]

How do language models learn facts? dynamics, curricula and hallucinations

Zucchet, N., Bornschein, J., Chan, S., Lampinen, A., Pascanu, R., and De, S. How do language models learn facts? dynamics, curricula and hallucinations. arXiv preprint arXiv:2503.21676, 2025

arXiv 2025