Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay

Andrew Gordon Wilson; Dongkyu Cho; Martin Marek; Pavel Izmailov; Rumi Chunara; Shikai Qiu

arxiv: 2605.26097 · v1 · pith:7746NDQ6new · submitted 2026-05-25 · 💻 cs.LG

Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay

Martin Marek , Dongkyu Cho , Shikai Qiu , Rumi Chunara , Pavel Izmailov , Andrew Gordon Wilson This is my paper

Pith reviewed 2026-06-29 23:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords catastrophic forgettinglanguage modelsreplaycontinual learningmodel capacityself-generated datafinetuning

0 comments

The pith

Self-generated samples from a language model can serve as replay data that nearly eliminates forgetting on prior tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When a language model is finetuned on a new task, performance on earlier tasks typically drops. The paper demonstrates that samples drawn from the model's own output distribution can replace stored exemplars and largely prevent this drop. Forgetting still occurs, however, if the model was pretrained so close to its capacity limit that little room remains for new knowledge. When capacity is available, self-generated replay also removes the usual speed-versus-stability tradeoff, letting the model use high learning rates and fewer steps without overwriting earlier knowledge.

Core claim

Self-generated samples serve as effective replay data, nearly eliminating forgetting. Forgetting nonetheless persists when the model has little remaining capacity: models pretrained close to saturation cannot absorb new information without overwriting prior knowledge. When capacity is not the limiting factor, low learning rates reduce forgetting but require substantially more training steps. Replay breaks this tradeoff, enabling fast, high-learning-rate finetuning without forgetting.

What carries the argument

self-generated replay using samples drawn from the model's own training distribution

If this is right

High learning-rate finetuning becomes feasible without catastrophic forgetting once self-generated replay is used.
Pretraining saturation level directly limits how much new information a model can acquire later.
The number of training steps needed to reach a target performance can be reduced when replay removes the need for low learning rates.
Continual learning pipelines no longer require storage of original task data if the model can generate its own replay buffer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that continual-learning performance will improve as pretraining leaves more unused capacity rather than pushing models to saturation.
The same self-generation approach might be tested in non-language domains where a model can also sample from an approximate training distribution.
If generation quality improves with scale, the effectiveness of self-replay may increase for larger models even when they are trained closer to capacity.

Load-bearing premise

Samples the model generates from its own distribution remain close enough to the original prior-task data that they can substitute for stored exemplars without creating a distribution shift that reintroduces forgetting.

What would settle it

If finetuning with self-generated samples still produces the same accuracy drop on held-out prior-task data as training without any replay, the claim that these samples function as effective replay would be refuted.

Figures

Figures reproduced from arXiv: 2605.26097 by Andrew Gordon Wilson, Dongkyu Cho, Martin Marek, Pavel Izmailov, Rumi Chunara, Shikai Qiu.

**Figure 1.** Figure 1: Forgetting can be mitigated by regularizing on self-generated samples, but model capacity still lower-bounds forgetting. Left: We pretrain a small transformer language model on a mix of English and Spanish text until convergence. The color of each line indicates the fraction of Spanish data. Because the model is small and trained to convergence, it does not have the capacity to achieve low loss on both lan… view at source ↗

**Figure 2.** Figure 2: Regularizing on past (replay) data prevents forgetting. An MLP is first pretrained on data on the left and then finetuned on data on the right. Without any regularization, training on new data changes the model’s predictions on the old data, substantially degrading performance. Adding a KL divergence penalty on the prior data keeps the old predictions fixed while allowing the model to fit the new data. The… view at source ↗

**Figure 3.** Figure 3: Replay mitigates forgetting of language models under shifting tasks. A small (2M) transformer language model is sequentially trained to perform add, reversal, sort, and modadd on 3-digit decimal inputs. Left: Using standard training, as the model learns a new task, its accuracy on prior tasks completely degrades. Right: By adding self-generated replay data, forgetting is entirely eliminated. These results … view at source ↗

**Figure 4.** Figure 4: tests replay in a language modeling setting. We first pretrain a transformer [12] on FineWebEdu (a general-domain pretraining corpus) [13] and then finetune it on Nemotron-CC-Math [14] to improve math performance. We compare standard finetuning, LoRA [15], and replay-based regularization. Standard finetuning achieves strong downstream performance but forgets the pretraining distribution, while LoRA learn… view at source ↗

**Figure 5.** Figure 5: Overtrained models forget more. We study the effect of model capacity by pretraining and finetuning on the English and Spanish subsets of C4 [25]. During finetuning, we sweep over different strengths of KL regularization – each run is shown as a single point on the plot. We finetune each model either until it reaches a fixed target finetuning loss ( ) or until the optimizer exceeds 100 epochs, at which poi… view at source ↗

**Figure 6.** Figure 6: Model capacity affects forgetting, but replay is more important. We pretrain models of varying sizes on a varying number of tokens of FineWeb-Edu, then finetune them on NemotronCC-Math until a fixed target loss is reached. The contour lines show pretraining loss before finetuning. The colorbar expresses pretraining loss after finetuning as a compute multiplier (a compute multiplier of 90% means that pret… view at source ↗

**Figure 7.** Figure 7: Pretraining and finetuning learning rate affect forgetting. We pretrain small (6M parameters) language models using different learning rates until they reach the target loss of 3.2 nats on English text, then finetune them using different learning rates until they reach the same loss on Spanish text. Left / Middle: Both with and without replay data (KL regularization), using a high pretraining learning rate… view at source ↗

**Figure 8.** Figure 8: Replay enables compute-efficient high-learning-rate finetuning. We study a computeefficient approach to finetuning on the English and Spanish subsets of the C4 dataset. Replay allows the model to be finetuned with a high learning rate while minimizing forgetting, reducing the number of optimization steps to reach the downstream target loss, thereby reducing wall time. 1All other experiments use one batch … view at source ↗

**Figure 9.** Figure 9: Instruction-tuned models benefit from replay of pretraining data. We finetune Llama3.2-1B-Instruct to generate Verilog code. Standard finetuning improves downstream performance at the cost of forgetting, while KL regularization almost entirely eliminates forgetting. KL regularization works equally well with both substitute and self-generated data, whereas NTP regularization is sensitive to the distributi… view at source ↗

read the original abstract

Models trained on a new task typically degrade on prior tasks, a phenomenon known as forgetting. Traditionally, mitigating forgetting has required replaying stored exemplars from prior tasks, which is often impractical. By contrast, language models can sample from their own training distribution, and we show that these self-generated samples serve as effective replay data, nearly eliminating forgetting. We find that forgetting nonetheless persists when the model has little remaining capacity: models pretrained close to saturation cannot absorb new information without overwriting prior knowledge. When capacity is not the limiting factor, low learning rates reduce forgetting but require substantially more training steps. Replay breaks this tradeoff, enabling fast, high-learning-rate finetuning without forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-generated replay nearly eliminates forgetting in LMs when spare capacity exists, but the paper's evidence for faithful distributional match is thin and the experiments need tighter controls.

read the letter

The core result is that sampling from the model itself can replace stored exemplars for replay and largely stops forgetting on prior tasks, except when the model is already near capacity saturation. That capacity observation is the clearest new angle here; it explains why low learning rates still help but at the cost of many more steps, and why replay removes that tradeoff.

The work does a reasonable job laying out the practical motivation and showing that the generated samples can stabilize performance in the regimes they tested. The capacity-saturation point is useful because it points to a hard limit rather than just another hyperparameter tweak.

The soft spot is the untested assumption that the self-generated distribution stays close enough to the original pretraining data. Autoregressive sampling often collapses modes or shifts length and token statistics, and without reported checks like embedding distances, token KL, or probe accuracy on the original versus generated sets, the stability could come from extra gradient steps or implicit regularization instead of true replay. The abstract gives no metrics, baselines, or controls, so the strength of the empirical claim is hard to judge from the summary alone.

This is for people working on continual learning or deployment of LMs who need practical replay alternatives. It deserves a serious referee because the problem is real and the capacity angle is worth checking, even if the current evidence on distributional fidelity looks preliminary.

Referee Report

2 major / 0 minor

Summary. The paper claims that self-generated samples from language models can serve as effective replay data to nearly eliminate catastrophic forgetting when finetuning on new tasks. It further reports that forgetting persists when models are pretrained close to saturation (limited remaining capacity), and that replay enables high learning-rate finetuning without the usual tradeoff where low learning rates reduce forgetting only at the cost of substantially more training steps.

Significance. If substantiated with rigorous controls, the result would be significant for continual learning in language models, as it offers a storage-free and privacy-preserving alternative to traditional replay. The distinction between capacity-limited and optimization-limited regimes provides a useful conceptual separation. The work would benefit from explicit credit for any ablation studies isolating replay from generic regularization effects.

major comments (2)

[Abstract] Abstract: the central claim that self-generated samples 'nearly eliminate forgetting' is load-bearing for the contribution, yet the provided text states empirical outcomes without any experimental details, metrics, baselines, or controls. This prevents evaluation of whether the observed stability arises from distributional replay or other factors.
[Abstract] Abstract (and implied results): the effectiveness of self-generated replay requires that sampled data match the original prior-task distribution sufficiently to avoid mode collapse or shift. No mention of quantification (e.g., token-level KL, embedding MMD, or probe accuracy on original vs. generated data) appears, so the no-forgetting result could reduce to extra gradient steps or regularization rather than faithful replay.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater specificity in the abstract and for explicit evidence that self-generated samples provide distributional replay rather than generic regularization. We will revise the abstract accordingly and add the requested distributional analyses in the main text.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that self-generated samples 'nearly eliminate forgetting' is load-bearing for the contribution, yet the provided text states empirical outcomes without any experimental details, metrics, baselines, or controls. This prevents evaluation of whether the observed stability arises from distributional replay or other factors.

Authors: We agree the abstract is too high-level. The full manuscript reports concrete metrics (prior-task accuracy), baselines (no-replay finetuning and stored-exemplar replay), and controls (learning-rate sweeps, capacity regimes). We will revise the abstract to include a brief statement of these elements, e.g., “measured by prior-task accuracy against no-replay and stored-replay baselines.” revision: yes
Referee: [Abstract] Abstract (and implied results): the effectiveness of self-generated replay requires that sampled data match the original prior-task distribution sufficiently to avoid mode collapse or shift. No mention of quantification (e.g., token-level KL, embedding MMD, or probe accuracy on original vs. generated data) appears, so the no-forgetting result could reduce to extra gradient steps or regularization rather than faithful replay.

Authors: The manuscript already shows that replay permits high learning rates without the usual forgetting penalty, an outcome not explained by extra steps or generic regularization alone (we equate total gradient steps across conditions). Nevertheless, we accept that explicit distributional fidelity checks are needed. In revision we will add quantification—probe accuracy on original data evaluated on generated samples and token-level distributional metrics—to confirm the generated data remain faithful to the prior-task distribution. revision: yes

Circularity Check

0 steps flagged

Empirical results on self-generated replay show no circularity in derivation chain

full rationale

The paper reports experimental findings on catastrophic forgetting in language models, demonstrating that self-generated samples can serve as effective replay to mitigate forgetting, with additional observations on capacity limits and learning rate tradeoffs. No equations, fitted parameters, or derivation steps are described that reduce a claimed prediction or result to its own inputs by construction. The central claims rest on empirical measurements rather than a mathematical chain, uniqueness theorems, or ansatzes imported via self-citation. The work is self-contained against external benchmarks via direct experiments, with no load-bearing self-citations or self-definitional constructs visible. This is the expected outcome for an empirical ML study without theoretical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5654 in / 976 out tokens · 21433 ms · 2026-06-29T23:13:46.487381+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
cs.LG 2026-05 unverdicted novelty 6.0

Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B p...

Reference graph

Works this paper leans on

64 extracted references · 47 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re)warm your model?, 2023. URLhttps://arxiv.org/abs/2308.04014

work page arXiv 2023
[2]

Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page doi:10.1073/pnas.1611835114 2017
[3]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Scaling laws for forgetting during finetuning with pretraining data injection, 2025

Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin. Scaling laws for forgetting during finetuning with pretraining data injection, 2025. URL https://arxiv.org/abs/2502.06042

work page arXiv 2025
[5]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations, volume 2024, pages 30988–31043, 2024

2024
[6]

A comprehensive survey of forget- ting in deep learning beyond continual learning, 2024

Zhenyi Wang, Enneng Yang, Li Shen, and Heng Huang. A comprehensive survey of forget- ting in deep learning beyond continual learning, 2024. URLhttps://arxiv.org/abs/2307. 09218. 9

2024
[7]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URLhttps://arxiv. org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Replaying pre-training data improves fine-tuning, 2026

Suhas Kotha and Percy Liang. Replaying pre-training data improves fine-tuning, 2026. URL https://arxiv.org/abs/2603.04964

work page arXiv 2026
[9]

Bagdanov, and Joost van de Weijer

Marc Masana, Xialei Liu, Bartlomiej Twardowski, Mikel Menta, Andrew D. Bagdanov, and Joost van de Weijer. Class-incremental learning: survey and performance evaluation on image classification, 2022. URLhttps://arxiv.org/abs/2010.15277

work page arXiv 2022
[10]

Lillicrap, and Greg Wayne

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne. Expe- rience replay for continual learning, 2019. URLhttps://arxiv.org/abs/1811.11682

work page arXiv 2019
[11]

The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects.Fron- tiers in Psychology, V olume 4 - 2013, 2013

Martial Mermillod, Aurélia Bugaiska, and Patrick BONIN. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects.Fron- tiers in Psychology, V olume 4 - 2013, 2013. ISSN 1664-1078. doi: 10.3389/fpsyg. 2013.00504. URLhttps://www.frontiersin.org/journals/psychology/articles/10. 3389/fpsyg.2013.00504

work page doi:10.3389/fpsyg 2013
[12]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[13]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset, 2025

Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mo- hammad Shoeybi, and Bryan Catanzaro. Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset, 2025. URLhttps://arxiv.org/abs/2508.15096

work page arXiv 2025
[15]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

1999
[17]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

1989
[18]

Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal, 2024

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal, 2024. URLhttps://arxiv.org/abs/2403.01244

work page arXiv 2024
[19]

Physics of language models: Part 3.3, knowledge capacity scaling laws, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws, 2024. URLhttps://arxiv.org/abs/2404.05405

work page arXiv 2024
[20]

Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexan- der M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize?, 2025. URLhttps://arxiv.org/abs/2505.24832

work page arXiv 2025
[21]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Hyperparame- ter transfer enables consistent gains of matrix-preconditioned optimizers across scales

Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, and Andrew Gordon Wilson. Hyperparame- ter transfer enables consistent gains of matrix-preconditioned optimizers across scales. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=Ei6IsmxYrb. 10

2026
[23]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y13gSfTjGr

2024
[24]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URLhttps://arxiv.org/abs/1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

A scalable measure of loss landscape curvature for analyzing the training dynamics of llms, 2026

Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, and Michael Shvartsman. A scalable measure of loss landscape curvature for analyzing the training dynamics of llms, 2026. URLhttps://arxiv.org/abs/ 2601.16979

work page arXiv 2026
[27]

Jeremy Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D. Lee. Under- standing optimization in deep learning with central flows. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id= sIE2rI3ZPs

2025
[28]

Pre-training llm without learning rate decay enhances supervised fine-tuning, 2026

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, and Jun Suzuki. Pre-training llm without learning rate decay enhances supervised fine-tuning, 2026. URLhttps://arxiv. org/abs/2603.16127

work page arXiv 2026
[29]

Training dynamics impact post- training quantization robustness, 2026

Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness, 2026. URLhttps://arxiv.org/abs/2510.06213

work page arXiv 2026
[30]

A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021

Chao Ma, Lei Wu, and Weinan E. A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021. URLhttps://arxiv.org/abs/2009.06125

work page arXiv 2021
[31]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/ abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Revisiting verilogeval: A year of improvements in large-language models for hardware code generation, 2025

Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany. Revisiting verilogeval: A year of improvements in large-language models for hardware code generation, 2025. URLhttps://arxiv.org/abs/2408.11053

work page arXiv 2025
[33]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Pyranet: A multi-layered hierarchical dataset for verilog

Bardia Nadimi, Ghali Omar Boutaib, and Hao Zheng. Pyranet: A multi-layered hierarchical dataset for verilog. In2025 62nd ACM/IEEE Design Automation Conference (DAC), page 1–7. IEEE, 2025. doi: 10.1109/dac63849.2025.11133406. URLhttp://dx.doi.org/10.1109/ DAC63849.2025.11133406

work page doi:10.1109/dac63849.2025.11133406 2025
[35]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps: //arxiv.org/abs/2009.03300. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URLhttps: //arxiv.org/abs/1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

Memory Aware Synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget, 2018. URLhttps://arxiv. org/abs/1711.09601

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning, 2017. URLhttps://arxiv.org/ abs/1611.07725

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Continual Learning with Deep Generative Replay

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay, 2017. URLhttps://arxiv.org/abs/1705.08690

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Grow, don’t overwrite: Fine-tuning without forgetting, 2026

Dyah Adila, Hanna Mazzawi, Benoit Dherin, and Xavier Gonzalvo. Grow, don’t overwrite: Fine-tuning without forgetting, 2026. URLhttps://arxiv.org/abs/2603.08647

work page arXiv 2026
[42]

Der: Dynamically expandable representation for class incremental learning, 2021

Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning, 2021. URLhttps://arxiv.org/abs/2103.16788

work page arXiv 2021
[43]

Foster: Feature boosting and compression for class-incremental learning, 2022

Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Foster: Feature boosting and compression for class-incremental learning, 2022. URLhttps://arxiv.org/abs/2204. 04662

2022
[44]

Ameya Prabhu, Hasan Abed Al Kader Hammoud, Puneet Dokania, Philip H. S. Torr, Ser-Nam Lim, Bernard Ghanem, and Adel Bibi. Computationally budgeted continual learning: What does matter?, 2023. URLhttps://arxiv.org/abs/2303.11165

work page arXiv 2023
[45]

Forget forgetting: Continual learning in a world of abundant memory, 2026

Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, and Sungmin Cha. Forget forgetting: Continual learning in a world of abundant memory, 2026. URLhttps://arxiv. org/abs/2502.07274

work page arXiv 2026
[46]

Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024. URLhttps://www.nature.com/articles/s41586-024-07711-7

2024
[47]

Continual learning should move beyond incremental classification, 2025

Rupert Mitchell, Antonio Alliegro, Raffaello Camoriano, Dustin Carrión-Ojeda, Antonio Carta, Georgia Chalvatzaki, Nikhil Churamani, Carlo D’Eramo, Samin Hamidi, Robin Hesse, Fabian Hinder, Roshni Ramanna Kamath, Vincenzo Lomonaco, Subarnaduti Paul, Francesca Pistilli, Tinne Tuytelaars, Gido M van de Ven, Kristian Kersting, Simone Schaub-Meyer, and Martin ...

work page arXiv 2025
[48]

Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey, 2024. URLhttps: //arxiv.org/abs/2402.01364

work page arXiv 2024
[49]

Peer pressure: Model-to-model regular- ization for single source domain generalization, 2025

Dong Kyu Cho, Inwoo Hwang, and Sanghack Lee. Peer pressure: Model-to-model regular- ization for single source domain generalization, 2025. URLhttps://arxiv.org/abs/2505. 12745

2025
[50]

Task-agnostic continual learning with hybrid probabilistic models

Polina Kirichenko, Mehrdad Farajtabar, Dushyant Rao, Balaji Lakshminarayanan, Nir Levine, Ang Li, Huiyi Hu, Andrew Gordon Wilson, and Razvan Pascanu. Task-agnostic continual learning with hybrid probabilistic models. InICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. URLhttps://openreview.net/ forum?id=...

2021
[51]

Self-generated replay memories for continual neural ma- chine translation, 2024

Michele Resta and Davide Bacciu. Self-generated replay memories for continual neural ma- chine translation, 2024. URLhttps://arxiv.org/abs/2403.13130. 12

work page arXiv 2024
[52]

Continual pre-training of language models, 2023

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models, 2023. URLhttps://arxiv.org/abs/2302.03241

work page arXiv 2023
[53]

Larsen, Jason Chan Lee, Katherine L

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blak- eney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vi- neeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason ...

work page arXiv 2026
[54]

Buckley, Ja- son Phang, Samuel R

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Ja- son Phang, Samuel R. Bowman, and Ethan Perez. Pretraining language models with human preferences, 2023. URLhttps://arxiv.org/abs/2302.08582

work page arXiv 2023
[55]

Overtrained language models are harder to fine-tune, 2025

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune, 2025. URLhttps://arxiv.org/abs/2503.19206

work page arXiv 2025
[56]

Same pre-training loss, better downstream: Implicit bias matters for language models, 2022

Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models, 2022. URLhttps://arxiv.org/ abs/2210.14199

work page arXiv 2022
[57]

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghu- nathan. Sharpness-aware pretraining mitigates catastrophic forgetting.arXiv preprint arXiv:2605.02105, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Knowledge entropy decay during language model pre- training hinders new knowledge acquisition, 2025

Jiyeon Kim, Hyunji Lee, Hyowon Cho, Joel Jang, Hyeonbin Hwang, Seungpil Won, Youbin Ahn, Dohaeng Lee, and Minjoon Seo. Knowledge entropy decay during language model pre- training hinders new knowledge acquisition, 2025. URLhttps://arxiv.org/abs/2410. 01380

2025
[59]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[60]

Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

2019
[61]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[63]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. InProceedings of the 50th annual international symposium on computer architecture, pages 1–14, 2023

2023
[64]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URLhttp: //github.com/jax-ml/jax. 13 A Method Details A.1 Self-generated Replay. We generate replay data from th...

2018

[1] [1]

Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re)warm your model?, 2023. URLhttps://arxiv.org/abs/2308.04014

work page arXiv 2023

[2] [2]

Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

work page doi:10.1073/pnas.1611835114 2017

[3] [3]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Scaling laws for forgetting during finetuning with pretraining data injection, 2025

Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin. Scaling laws for forgetting during finetuning with pretraining data injection, 2025. URL https://arxiv.org/abs/2502.06042

work page arXiv 2025

[5] [5]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Hen- derson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations, volume 2024, pages 30988–31043, 2024

2024

[6] [6]

A comprehensive survey of forget- ting in deep learning beyond continual learning, 2024

Zhenyi Wang, Enneng Yang, Li Shen, and Heng Huang. A comprehensive survey of forget- ting in deep learning beyond continual learning, 2024. URLhttps://arxiv.org/abs/2307. 09218. 9

2024

[7] [7]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URLhttps://arxiv. org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Replaying pre-training data improves fine-tuning, 2026

Suhas Kotha and Percy Liang. Replaying pre-training data improves fine-tuning, 2026. URL https://arxiv.org/abs/2603.04964

work page arXiv 2026

[9] [9]

Bagdanov, and Joost van de Weijer

Marc Masana, Xialei Liu, Bartlomiej Twardowski, Mikel Menta, Andrew D. Bagdanov, and Joost van de Weijer. Class-incremental learning: survey and performance evaluation on image classification, 2022. URLhttps://arxiv.org/abs/2010.15277

work page arXiv 2022

[10] [10]

Lillicrap, and Greg Wayne

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne. Expe- rience replay for continual learning, 2019. URLhttps://arxiv.org/abs/1811.11682

work page arXiv 2019

[11] [11]

The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects.Fron- tiers in Psychology, V olume 4 - 2013, 2013

Martial Mermillod, Aurélia Bugaiska, and Patrick BONIN. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects.Fron- tiers in Psychology, V olume 4 - 2013, 2013. ISSN 1664-1078. doi: 10.3389/fpsyg. 2013.00504. URLhttps://www.frontiersin.org/journals/psychology/articles/10. 3389/fpsyg.2013.00504

work page doi:10.3389/fpsyg 2013

[12] [12]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[13] [13]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset, 2025

Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mo- hammad Shoeybi, and Bryan Catanzaro. Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset, 2025. URLhttps://arxiv.org/abs/2508.15096

work page arXiv 2025

[15] [15]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

1999

[17] [17]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

1989

[18] [18]

Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal, 2024

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal, 2024. URLhttps://arxiv.org/abs/2403.01244

work page arXiv 2024

[19] [19]

Physics of language models: Part 3.3, knowledge capacity scaling laws, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws, 2024. URLhttps://arxiv.org/abs/2404.05405

work page arXiv 2024

[20] [20]

Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexan- der M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize?, 2025. URLhttps://arxiv.org/abs/2505.24832

work page arXiv 2025

[21] [21]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Hyperparame- ter transfer enables consistent gains of matrix-preconditioned optimizers across scales

Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, and Andrew Gordon Wilson. Hyperparame- ter transfer enables consistent gains of matrix-preconditioned optimizers across scales. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=Ei6IsmxYrb. 10

2026

[23] [23]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y13gSfTjGr

2024

[24] [24]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URLhttps://arxiv.org/abs/1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

A scalable measure of loss landscape curvature for analyzing the training dynamics of llms, 2026

Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, and Michael Shvartsman. A scalable measure of loss landscape curvature for analyzing the training dynamics of llms, 2026. URLhttps://arxiv.org/abs/ 2601.16979

work page arXiv 2026

[27] [27]

Jeremy Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D. Lee. Under- standing optimization in deep learning with central flows. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id= sIE2rI3ZPs

2025

[28] [28]

Pre-training llm without learning rate decay enhances supervised fine-tuning, 2026

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, and Jun Suzuki. Pre-training llm without learning rate decay enhances supervised fine-tuning, 2026. URLhttps://arxiv. org/abs/2603.16127

work page arXiv 2026

[29] [29]

Training dynamics impact post- training quantization robustness, 2026

Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post- training quantization robustness, 2026. URLhttps://arxiv.org/abs/2510.06213

work page arXiv 2026

[30] [30]

A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021

Chao Ma, Lei Wu, and Weinan E. A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021. URLhttps://arxiv.org/abs/2009.06125

work page arXiv 2021

[31] [31]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/ abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Revisiting verilogeval: A year of improvements in large-language models for hardware code generation, 2025

Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany. Revisiting verilogeval: A year of improvements in large-language models for hardware code generation, 2025. URLhttps://arxiv.org/abs/2408.11053

work page arXiv 2025

[33] [33]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Pyranet: A multi-layered hierarchical dataset for verilog

Bardia Nadimi, Ghali Omar Boutaib, and Hao Zheng. Pyranet: A multi-layered hierarchical dataset for verilog. In2025 62nd ACM/IEEE Design Automation Conference (DAC), page 1–7. IEEE, 2025. doi: 10.1109/dac63849.2025.11133406. URLhttp://dx.doi.org/10.1109/ DAC63849.2025.11133406

work page doi:10.1109/dac63849.2025.11133406 2025

[35] [35]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps: //arxiv.org/abs/2009.03300. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URLhttps: //arxiv.org/abs/1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [37]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

Memory Aware Synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget, 2018. URLhttps://arxiv. org/abs/1711.09601

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning, 2017. URLhttps://arxiv.org/ abs/1611.07725

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Continual Learning with Deep Generative Replay

Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay, 2017. URLhttps://arxiv.org/abs/1705.08690

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Grow, don’t overwrite: Fine-tuning without forgetting, 2026

Dyah Adila, Hanna Mazzawi, Benoit Dherin, and Xavier Gonzalvo. Grow, don’t overwrite: Fine-tuning without forgetting, 2026. URLhttps://arxiv.org/abs/2603.08647

work page arXiv 2026

[42] [42]

Der: Dynamically expandable representation for class incremental learning, 2021

Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning, 2021. URLhttps://arxiv.org/abs/2103.16788

work page arXiv 2021

[43] [43]

Foster: Feature boosting and compression for class-incremental learning, 2022

Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Foster: Feature boosting and compression for class-incremental learning, 2022. URLhttps://arxiv.org/abs/2204. 04662

2022

[44] [44]

Ameya Prabhu, Hasan Abed Al Kader Hammoud, Puneet Dokania, Philip H. S. Torr, Ser-Nam Lim, Bernard Ghanem, and Adel Bibi. Computationally budgeted continual learning: What does matter?, 2023. URLhttps://arxiv.org/abs/2303.11165

work page arXiv 2023

[45] [45]

Forget forgetting: Continual learning in a world of abundant memory, 2026

Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, and Sungmin Cha. Forget forgetting: Continual learning in a world of abundant memory, 2026. URLhttps://arxiv. org/abs/2502.07274

work page arXiv 2026

[46] [46]

Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024. URLhttps://www.nature.com/articles/s41586-024-07711-7

2024

[47] [47]

Continual learning should move beyond incremental classification, 2025

Rupert Mitchell, Antonio Alliegro, Raffaello Camoriano, Dustin Carrión-Ojeda, Antonio Carta, Georgia Chalvatzaki, Nikhil Churamani, Carlo D’Eramo, Samin Hamidi, Robin Hesse, Fabian Hinder, Roshni Ramanna Kamath, Vincenzo Lomonaco, Subarnaduti Paul, Francesca Pistilli, Tinne Tuytelaars, Gido M van de Ven, Kristian Kersting, Simone Schaub-Meyer, and Martin ...

work page arXiv 2025

[48] [48]

Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey, 2024. URLhttps: //arxiv.org/abs/2402.01364

work page arXiv 2024

[49] [49]

Peer pressure: Model-to-model regular- ization for single source domain generalization, 2025

Dong Kyu Cho, Inwoo Hwang, and Sanghack Lee. Peer pressure: Model-to-model regular- ization for single source domain generalization, 2025. URLhttps://arxiv.org/abs/2505. 12745

2025

[50] [50]

Task-agnostic continual learning with hybrid probabilistic models

Polina Kirichenko, Mehrdad Farajtabar, Dushyant Rao, Balaji Lakshminarayanan, Nir Levine, Ang Li, Huiyi Hu, Andrew Gordon Wilson, and Razvan Pascanu. Task-agnostic continual learning with hybrid probabilistic models. InICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. URLhttps://openreview.net/ forum?id=...

2021

[51] [51]

Self-generated replay memories for continual neural ma- chine translation, 2024

Michele Resta and Davide Bacciu. Self-generated replay memories for continual neural ma- chine translation, 2024. URLhttps://arxiv.org/abs/2403.13130. 12

work page arXiv 2024

[52] [52]

Continual pre-training of language models, 2023

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models, 2023. URLhttps://arxiv.org/abs/2302.03241

work page arXiv 2023

[53] [53]

Larsen, Jason Chan Lee, Katherine L

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blak- eney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vi- neeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason ...

work page arXiv 2026

[54] [54]

Buckley, Ja- son Phang, Samuel R

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Ja- son Phang, Samuel R. Bowman, and Ethan Perez. Pretraining language models with human preferences, 2023. URLhttps://arxiv.org/abs/2302.08582

work page arXiv 2023

[55] [55]

Overtrained language models are harder to fine-tune, 2025

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. Overtrained language models are harder to fine-tune, 2025. URLhttps://arxiv.org/abs/2503.19206

work page arXiv 2025

[56] [56]

Same pre-training loss, better downstream: Implicit bias matters for language models, 2022

Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models, 2022. URLhttps://arxiv.org/ abs/2210.14199

work page arXiv 2022

[57] [57]

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, and Aditi Raghu- nathan. Sharpness-aware pretraining mitigates catastrophic forgetting.arXiv preprint arXiv:2605.02105, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

Knowledge entropy decay during language model pre- training hinders new knowledge acquisition, 2025

Jiyeon Kim, Hyunji Lee, Hyowon Cho, Joel Jang, Hyeonbin Hwang, Seungpil Won, Youbin Ahn, Dohaeng Lee, and Minjoon Seo. Knowledge entropy decay during language model pre- training hinders new knowledge acquisition, 2025. URLhttps://arxiv.org/abs/2410. 01380

2025

[59] [59]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[60] [60]

Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in Neural Information Processing Systems, 32, 2019

2019

[61] [61]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[62] [62]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019

[63] [63]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. InProceedings of the 50th annual international symposium on computer architecture, pages 1–14, 2023

2023

[64] [64]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URLhttp: //github.com/jax-ml/jax. 13 A Method Details A.1 Self-generated Replay. We generate replay data from th...

2018