pith. sign in

arxiv: 2606.26050 · v1 · pith:53Z6YBKSnew · submitted 2026-06-24 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· cs.CL

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

Pith reviewed 2026-06-25 19:02 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AIcs.CL
keywords natural ungrokkingpretraining dynamicsrule survivalsupport frequencyasymmetric controllanguage model forgettingdisplacement
0
0 comments X

The pith

Support frequency in the training stream decides which rules survive pretraining, and control is asymmetric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Midway through pretraining a language model can acquire a rule such as pronoun-gender agreement and generalize on held-out probes, only for the same rule to collapse to near-zero performance later even though its evidence remains in the data. The paper shows that a single corpus statistic predicts which rules persist: the frequency with which the training stream presents the rule winning against alternatives. Across multiple corpora, training budgets, and random seeds, higher support frequency sustains the rule while lower support leads to displacement by a competing surface pattern. The data-to-parameter ratio affects only how far a losing rule falls, not whether it survives. Interventions confirm asymmetry: flipping support to counter-evidence reliably kills the rule in a dose-dependent way, yet restoring support up to hundreds of times the natural level produces no recovery.

Core claim

The paper establishes that natural ungrokking occurs when a learned rule collapses during continued pretraining despite persistent evidence in the corpus, and that survival is governed by support frequency—the rate at which the training stream shows the rule winning. This frequency determines the rule's fate across un-intervened runs, while the forgetting itself is a displacement in which a competing pattern's log-probability margin crosses zero near the behavioral collapse. Control is asymmetric: the same edit that introduces counter-evidence destroys the rule monotonically, but injecting support back fails to restore it even at 450 times the sustaining level. The pattern also appears in pu

What carries the argument

support frequency: the proportion of training examples in which a given rule wins against competitors, which determines whether the rule is retained or displaced.

If this is right

  • The data-to-parameter ratio modulates collapse depth for doomed rules but does not alter which rules survive.
  • The same emerge-then-collapse dynamics appear in Pythia checkpoints with depth ordered by model scale.
  • Introducing counter-evidence kills a rule with monotone dose-response across unrelated rules.
  • Injecting support fails to recover a collapsed rule even at 450 times the natural sustaining level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Curating training streams to raise support frequency for desired rules could selectively preserve specific behaviors without changing model scale.
  • The same frequency-based competition may explain why some facts persist while others are overwritten in larger web-trained models.
  • The asymmetry suggests prevention of unwanted rule acquisition during training is more tractable than post-hoc correction.

Load-bearing premise

The observed collapse on held-out probes reflects genuine displacement of the rule by a competing pattern rather than changes in probe sensitivity or other unmeasured dynamics.

What would settle it

A rule whose support frequency exceeds that of its competitor still collapses on probes, or high-level support injection after collapse restores probe performance above baseline.

Figures

Figures reproduced from arXiv: 2606.26050 by Diya Sreedhar, Juliana Li.

Figure 1
Figure 1. Figure 1: The focal pronoun-gender rule emerges and then collapses under web pretraining, and an internal margin crosses zero as it does. (a) Held-out conflict accuracy. On TinyStories (green: seed mean, min–max band) the rule survives at ceiling; on web the bold vermillion trace is the seed the abstract quotes (0.94 by step 925, gone by the end), with the two faint traces rising and collapsing the same way. Dotted … view at source ↗
Figure 3
Figure 3. Figure 3: The transient signature in public checkpoints, same frozen battery: held-out conflict accuracy for the focal rule across Pythia revisions (70M–1.4B) and OLMo-1B, frozen k=3 smooth￾ing, log-axis steps. The rule emerges early in every model; end-of￾training survival is ordered by scale (PUB3, Spearman ρ = 0.894), the smallest model falling deepest and collapse gone by 410M. Agree controls of the collapsing m… view at source ↗
Figure 4
Figure 4. Figure 4: The two intervention directions, each on its own dose axis with the un-intervened base at dose zero. Top strips: per-seed behavioral outcome (colors as in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The second mechanism measure. (a) Exact decomposition of the pre-softcap she–he gap on the conflict prefixes, per web seed: the contextual contribution Cctx (vermillion) carries the margin and ends negative, the direct embedding-path contribution Cdir (grey) stays near zero throughout; vertical line = behavioral collapse onset. The measure requires no peak-height gate, so all three seeds are shown at full … view at source ↗
read the original abstract

Midway through an ordinary pretraining run, a small language model learns the pronoun-gender rule: cued with a girl's name ("Sue cried because"), it resolves the next pronoun to she, generalizing to held-out probes (0.94 by step 925). By step 3,500 the same model scores near zero on the same probes, although the rule's evidence is still in the training data. We call this within-run reversal natural ungrokking: the corpus decides, with no trace in the loss curve, which learned rules a model keeps. Which rules survive is predictable from one corpus statistic: how often the training stream shows the rule winning. Across un-intervened runs (two corpora, three budgets, three seeds), support frequency decides a rule's fate; the data-to-parameter ratio only modulates how deeply a doomed rule falls. The same emerge-then-collapse dynamics appear in public Pythia checkpoints, collapse depth ordered by model scale as predicted. The forgetting is a displacement: a competing surface pattern out-competes the rule, and the log-probability margin between them crosses zero within 100 training steps of the behavioral collapse. Control over this fate is asymmetric: the same edit that destroys a rule on demand cannot restore it. Flipping support to counter-evidence in place kills the rule with monotone dose-response in two unrelated rules; but injecting support back, even to 450 times the level that naturally sustains it, buys no recovery. Every confirmatory threshold and prediction was pre-registered before the data it governed was read.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports the phenomenon of 'natural ungrokking' during language model pretraining: models learn rules such as pronoun-gender resolution (reaching 0.94 accuracy on held-out probes) that then collapse to near-zero despite the rule's evidence remaining in the training data. Survival of rules is claimed to be predictable from a single independent corpus statistic (support frequency, i.e., how often the training stream shows the rule winning), with the data-to-parameter ratio only modulating collapse depth; this is shown across two corpora, three budgets, three seeds, replicated on public Pythia checkpoints, and with asymmetric control via support interventions. All confirmatory thresholds and predictions were pre-registered.

Significance. If the displacement interpretation holds, the work identifies a corpus-statistic predictor for rule persistence that operates without loss-curve signatures and demonstrates asymmetric editability of learned rules. Credit is due for the pre-registered predictions, consistent results across un-intervened runs and public checkpoints, and the explicit measurement of log-probability margins crossing zero near collapse.

major comments (2)
  1. [Abstract] Abstract: the central claim that behavioral collapse on held-out probes reflects genuine displacement (rather than probe sensitivity drift or representation shift) is load-bearing for both the support-frequency predictability result and the asymmetric-control findings. The abstract asserts that 'the rule's evidence remains in the training data' and that 'the log-probability margin between them crosses zero within 100 training steps,' yet provides no explicit controls such as re-testing on original training instances containing the evidence or ablation of probe formatting to rule out changes in probe sensitivity.
  2. [Intervention experiments] Intervention results: the monotone dose-response for rule destruction via counter-evidence injection is presented as decisive, but the manuscript does not report whether the edited support frequency was measured on the same independent corpus statistic used for the un-intervened runs, leaving open whether the asymmetry result could be an artifact of how the interventions alter the overall training distribution.
minor comments (2)
  1. [Abstract] The abstract states results are consistent across 'two corpora, three budgets, three seeds' but does not specify the exact statistical test or threshold used for 'decides a rule's fate'; this detail belongs in the main text or a methods appendix.
  2. [Pythia replication section] Replication on Pythia checkpoints is a strength, but the manuscript should clarify whether the support-frequency statistic was computed from the original training stream or approximated from the released data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the need for explicit validation of the displacement claim and intervention measurements. We address each major comment below and commit to revisions where the manuscript requires clarification or additional reporting.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that behavioral collapse on held-out probes reflects genuine displacement (rather than probe sensitivity drift or representation shift) is load-bearing for both the support-frequency predictability result and the asymmetric-control findings. The abstract asserts that 'the rule's evidence remains in the training data' and that 'the log-probability margin between them crosses zero within 100 training steps,' yet provides no explicit controls such as re-testing on original training instances containing the evidence or ablation of probe formatting to rule out changes in probe sensitivity.

    Authors: We agree the abstract is concise and does not enumerate controls. The main text supports the displacement interpretation via direct measurement of the log-probability margin between the rule and competing surface pattern, which crosses zero near collapse, and by confirming the rule's evidence persists in the corpus. However, the manuscript does not report re-testing on original training instances or probe-formatting ablations. We will add these controls (re-testing on evidence-containing instances and probe ablations) to the revised manuscript and update the abstract to reference them. This strengthens the claim without altering the pre-registered predictions. revision: yes

  2. Referee: [Intervention experiments] Intervention results: the monotone dose-response for rule destruction via counter-evidence injection is presented as decisive, but the manuscript does not report whether the edited support frequency was measured on the same independent corpus statistic used for the un-intervened runs, leaving open whether the asymmetry result could be an artifact of how the interventions alter the overall training distribution.

    Authors: The interventions were constructed to modify support frequency using the identical definition and independent corpus statistic applied to the un-intervened runs. Post-edit support frequencies were verified on that same statistic to confirm the target levels were achieved. The asymmetry (destruction succeeds; restoration fails even at 450x natural support) is therefore tied to the same predictor. The manuscript does not explicitly state this verification step for the edited runs. We will add a dedicated paragraph in the methods and results sections reporting the post-intervention measurements on the independent corpus to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: support frequency measured independently from corpus; survival outcome not used to define or fit it.

full rationale

The paper's central claim is that rule survival is predicted by an independently measured corpus statistic (support frequency, i.e., how often the training stream shows the rule winning). This statistic is extracted directly from the training data stream before observing outcomes. Behavioral collapse on probes is treated as the dependent variable, not fed back into the statistic. Pre-registration of thresholds and predictions is stated explicitly. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain. The derivation remains self-contained against external benchmarks (multiple corpora, seeds, Pythia checkpoints).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters or invented entities are stated. The central claim rests on the domain assumption that held-out probes measure rule retention independently of the frequency statistic.

axioms (1)
  • domain assumption Held-out behavioral probes accurately reflect whether the model has retained the pronoun-gender rule.
    The abstract uses probe accuracy (0.94 then near zero) as the measure of rule presence and absence.

pith-pipeline@v0.9.1-grok · 5821 in / 1329 out tokens · 38673 ms · 2026-06-25T19:02:00.100483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 6 canonical work pages

  1. [1]

    Critical learning periods in deep networks

    Achille, A., Rovere, M., and Soatto, S. Critical learning periods in deep networks. In International Conference on Learning Representations, 2019. arXiv:1711.08856

  2. [2]

    A., Purohit, S., Prashanth, U

    Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, 2023. arXiv:2304.01373

  3. [3]

    C., Xu, P., Och, F

    Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp.\ 858--867, 2007

  4. [4]

    Chang, T. A. and Bergen, B. K. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10: 0 1--16, 2022. doi:10.1162/tacl_a_00444

  5. [5]

    and Tu, Zhuowen and Bergen, Benjamin K

    Chang, T. A., Tu, Z., and Bergen, B. K. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability. Transactions of the Association for Computational Linguistics, 12: 0 1346--1362, 2024. doi:10.1162/tacl_a_00708

  6. [6]

    L., and Saphra, N

    Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M. L., and Saphra, N. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in mlms. In International Conference on Learning Representations, 2024. arXiv:2309.07311

  7. [7]

    The grammar-learning trajectories of neural language models

    Choshen, L., Hacohen, G., Weinshall, D., and Abend, O. The grammar-learning trajectories of neural language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

  8. [8]

    Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training

    Diao, S., Yang, Y., Fu, Y., Dong, X., Su, D., Kliegl, M., Chen, Z., Belcak, P., Suhara, Y., Yin, H., Patwary, M., Lin, Y., Kautz, J., and Molchanov, P. Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training. arXiv preprint arXiv:2504.13161, 2025

  9. [9]

    and Li, Y

    Eldan, R. and Li, Y. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

  10. [10]

    A mathematical framework for transformer circuits

    Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html

  11. [11]

    G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al

    Gemma Team , Rivi \`e re, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram \'e , A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

  12. [12]

    H., Ivison, H., Magnusson, I., Wang, Y., et al

    Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., et al. Olmo: Accelerating the science of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. arXiv:2402.00838

  13. [13]

    A., Welbl, J., Clark, A., et al

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  14. [14]

    The developmental landscape of in-context learning

    Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. The developmental landscape of in-context learning. arXiv preprint arXiv:2402.02364, 2024

  15. [15]

    Measuring forgetting of memorized training examples

    Jagielski, M., Thakkar, O., Tram \`e r, F., Ippolito, D., Lee, K., Carlini, N., Wallace, E., Song, S., Thakurta, A., Papernot, N., and Zhang, C. Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099, 2022

  16. [16]

    Muon: An optimizer for hidden layers in neural networks

    Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., and Bernstein, J. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/, 2024

  17. [17]

    Large language models struggle to learn long-tail knowledge

    Kandpal, N., Deng, H., Roberts, A., Wallace, E., and Raffel, C. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023

  18. [18]

    nanochat

    Karpathy, A. nanochat. https://github.com/karpathy/nanochat, 2025

  19. [19]

    A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017. arXiv:1612.00796

  20. [20]

    Sparse crosscoders for cross-layer features and model diffing

    Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., and Olah, C. Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/crosscoders/index.html

  21. [21]

    J., Tegmark, M., and Williams, M

    Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., and Williams, M. Towards understanding grokking: An effective theory of representation learning. In Advances in Neural Information Processing Systems, 2022

  22. [22]

    J., and Tegmark, M

    Liu, Z., Michaud, E. J., and Tegmark, M. Omnigrok: Grokking beyond algorithmic data. In International Conference on Learning Representations, 2023. arXiv:2210.01117

  23. [23]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  24. [24]

    Gender bias in neural natural language processing

    Lu, K., Mardziel, P., Wu, F., Amancharla, P., and Datta, A. Gender bias in neural natural language processing. In Logic, Language, and Security. Springer, 2020

  25. [25]

    Locating and editing factual associations in GPT

    Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, 2022

  26. [26]

    and Mahowald, K

    Misra, K. and Mahowald, K. Language models learn rare phenomena from less rare phenomena: The case of the missing AANNs . arXiv preprint arXiv:2403.19827, 2024

  27. [27]

    M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C

    Muennighoff, N., Rush, A. M., Barak, B., Le Scao, T., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models. In Advances in Neural Information Processing Systems, 2023. arXiv:2305.16264

  28. [28]

    Grokking of Hierarchical Structure in Vanilla Transformers

    Murty, S., Sharma, P., Andreas, J., and Manning, C. Grokking of hierarchical structure in vanilla transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 439--448, 2023. doi:10.18653/v1/2023.acl-short.38

  29. [29]

    Progress measures for grokking via mechanistic interpretability

    Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023. arXiv:2301.05217

  30. [30]

    In-context learning and induction heads

    Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. arXiv:2209.11895

  31. [31]

    Filtered Corpus Training ( F i CT ) Shows that Language Models Can Generalize from Indirect Evidence

    Patil, A., Jumelet, J., Chiu, Y. Y., Lapastora, A., Shen, P., Wang, L., Willrich, C., and Steinert-Threlkeld, S. Filtered corpus training ( FiCT ) shows that language models can generalize from indirect evidence. Transactions of the Association for Computational Linguistics, 12: 0 1597--1615, 2024. doi:10.1162/tacl_a_00720

  32. [32]

    Grokking: Generalization beyond overfitting on small algorithmic datasets

    Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

  33. [33]

    Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, 2023

    Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems, 2023. arXiv:2304.15004

  34. [34]

    Neural machine translation of rare words with subword units

    Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

  35. [35]

    K., Chan, S

    Singh, A. K., Chan, S. C., Moskovitz, T., Grant, E., Saxe, A. M., and Hill, F. The transient nature of emergent in-context learning in transformers. In Advances in Neural Information Processing Systems, 2023. arXiv:2311.08360

  36. [36]

    H., Zettlemoyer, L., and Aghajanyan, A

    Tirumala, K., Markosyan, A. H., Zettlemoyer, L., and Aghajanyan, A. Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems, 2022. arXiv:2205.10770

  37. [37]

    T., Trischler, A., Bengio, Y., and Gordon, G

    Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y., and Gordon, G. J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. arXiv:1812.05159

  38. [38]

    Explaining grokking through circuit efficiency

    Varma, V., Shah, R., Kenton, Z., Kram \'a r, J., and Kumar, R. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023

  39. [39]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

  40. [40]

    Investigating gender bias in language models using causal mediation analysis

    Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, 2020

  41. [41]

    Differentiation and specialization of attention heads via the refined local learning coefficient

    Wang, G., Hoogland, J., van Wingerden, S., Furman, Z., and Murfet, D. Differentiation and specialization of attention heads via the refined local learning coefficient. arXiv preprint arXiv:2410.02984, 2024

  42. [42]

    Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020 a . arXiv:1912.00582

  43. [43]

    Warstadt, A., Zhang, Y., Li, X., Liu, H., and Bowman, S. R. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp.\ 217--235, 2020 b . doi:10.18653/v1/2020.emnlp-main.16

  44. [44]

    Frequency effects on syntactic rule learning in transformers

    Wei, J., Garrette, D., Linzen, T., and Pavlick, E. Frequency effects on syntactic rule learning in transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 932--948, 2021. doi:10.18653/v1/2021.emnlp-main.72

  45. [45]

    H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. arXiv:2206.07682

  46. [46]

    M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q

    Xie, S. M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q. V., Ma, T., and Yu, A. W. Doremi: Optimizing data mixtures speeds up language model pretraining. In Advances in Neural Information Processing Systems, 2023. arXiv:2305.10429

  47. [47]

    Gender bias in coreference resolution: Evaluation and debiasing methods

    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of NAACL-HLT, 2018

  48. [48]

    How do language models learn facts? dynamics, curricula and hallucinations

    Zucchet, N., Bornschein, J., Chan, S., Lampinen, A., Pascanu, R., and De, S. How do language models learn facts? dynamics, curricula and hallucinations. arXiv preprint arXiv:2503.21676, 2025