TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Anurup Ganguli

arxiv: 2605.15053 · v2 · pith:KBVDFN5Qnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Anurup Ganguli This is my paper

Pith reviewed 2026-05-19 17:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual pre-trainingcatastrophic forgettinglarge language modelstransformer overlaystask-free learningreplay-free methodsorthogonal gradientsmeta-control layers

0 comments

The pith

TFGN is an architectural overlay that allows large language models to continually pre-train on new text domains without catastrophic forgetting, replay, or task labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TFGN as a solution to the problem of continually pre-training LLMs on diverse text domains without replay buffers, task identifiers, or scaling penalties. It uses a Read/Write decomposition in which the forward pass stays fully dense but updates are constrained to avoid overwriting prior domain subspaces. A sympathetic reader would care because this could enable models to learn from ongoing data streams while retaining earlier knowledge, addressing a key barrier to lifelong learning in AI. Results include backward transfer near zero and high retention on benchmarks like HellaSwag across scales up to 9B parameters on domains such as math, code, and biomedical text. It also demonstrates positive forward transfer between domains and includes extensions for meta-control and planning.

Core claim

TFGN achieves a backward transfer of -0.007 on LLaMA 3.1 8B Retrofit with HellaSwag retention scores of 0.506/0.504/0.510 and at least 99.59 percent L2-orthogonal gradient separation between domain pairs, all without replay, task IDs, or Fisher penalty. The same setup yields positive cross-domain forward transfer, including a 26.8 percent drop in held-out JavaScript perplexity from Python training at the 8B scale and 62 percent at GPT-2 Medium from scratch.

What carries the argument

The Read/Write decomposition, an architectural overlay for transformers where the forward pass is fully dense but cross-domain parameter updates are structured so that prior-domain subspaces are not written to.

If this is right

Continual pre-training on heterogeneous domains becomes possible at LLM scale with minimal forgetting.
Positive forward transfer occurs across domains even without task boundaries.
Closed-loop meta-control can further reduce forgetting by up to 81 percent at smaller scales.
Operator-level plan vectors can reshape model behavior at over 99.96 percent cosine fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could allow models to process continuous streams of new data while maintaining performance on earlier tasks.
The high degree of gradient separation might inspire similar designs in other machine learning domains.
The closed-loop meta-control layer points toward fully autonomous continual learning systems.
The operator-level plan vector could enable dynamic adaptation of model behavior based on latent plans.

Load-bearing premise

The Read/Write decomposition can be realized such that cross-domain parameter updates are structured to leave prior-domain subspaces unwritten while still permitting effective learning on new domains.

What would settle it

Training on one new domain and then observing a performance drop larger than -0.007 on a prior domain, or measuring gradient inner products that fall below 99.59 percent L2-orthogonality between domain pairs, would falsify the no-forgetting claim.

Figures

Figures reproduced from arXiv: 2605.15053 by Anurup Ganguli.

**Figure 2.** Figure 2: HellaSwag retention across continual phases. [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗

**Figure 3.** Figure 3: Gradient orthogonality across TFGN conditions. [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗

**Figure 4.** Figure 4: Figure E.A.2 — Three-axis decomposition of the Extension A 81% reduction. [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗

**Figure 5.** Figure 5: Figure E.A.1 — Extension A 11-condition BWT ladder across Tiers A, B, [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗

**Figure 6.** Figure 6: Figure E.A.0 — Closed-loop self-regulation (capability schematic). [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗

**Figure 7.** Figure 7: Figure E.B.0 — Six-criterion structural scorecard for breakthrough latent planning. [PITH_FULL_IMAGE:figures/full_fig_p046_7.png] view at source ↗

**Figure 8.** Figure 8: Figure E.B.1 — Extension B per-target reshape fidelity at [PITH_FULL_IMAGE:figures/full_fig_p048_8.png] view at source ↗

**Figure 9.** Figure 9: Figure E.B.B — Plan-vector measurement battery. [PITH_FULL_IMAGE:figures/full_fig_p049_9.png] view at source ↗

**Figure 10.** Figure 10: Figure E.B.3 — Sub-task injection rate, Python and Math sub-tasks. [PITH_FULL_IMAGE:figures/full_fig_p051_10.png] view at source ↗

read the original abstract

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TFGN claims a Read/Write decomposition that keeps prior-domain subspaces unwritten during continual pre-training, but the reported pairwise orthogonality does not yet confirm it survives a full six-domain sequence without drift.

read the letter

The paper's core idea is an architectural overlay on transformers that conditions parameter-efficient updates on the input while leaving the forward pass dense. This Read/Write split is meant to let new domains train without touching subspaces used by earlier ones, all without replay, task IDs, or Fisher penalties. They test it on six domains in sequence at three scales, including an 8B LLaMA retrofit, and report backward transfer near zero plus some positive forward transfer on held-out perplexity.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TFGN as an architectural overlay on transformer LLMs that enables continual pre-training across heterogeneous domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) without replay, task IDs, or Fisher penalties. It reports a backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention of 0.506/0.504/0.510, >=99.59% L2-orthogonal gradient separation between domain pairs, and positive forward transfer (e.g., 26.8% JavaScript PPL drop at LLaMA-8B from Python training) across three scales and two regimes. Extensions include a closed-loop meta-control layer and an operator-level plan vector.

Significance. If the Read/Write decomposition maintains persistent subspace isolation, this would constitute a meaningful architectural advance in continual learning at LLM scale by removing reliance on replay or regularization. The multi-scale evaluation (398M to 9B), demonstration of forward transfer, and extensions linking to meta-control systems are strengths that could influence future work on autonomous continual pre-training.

major comments (2)

[Abstract (architectural insight)] Abstract (architectural insight): The >=99.59% L2-orthogonal gradient separation is reported between domain pairs, yet the no-forgetting claim requires isolation to persist cumulatively after each successive phase in the six-domain sequence. Pairwise metrics do not automatically guarantee that prior-domain subspaces remain unwritten after later updates (e.g., after JavaScript training, does the Prose subspace retain isolation?), which is load-bearing for the central architectural claim.
[Evaluation metrics] Evaluation section: The concrete metrics (backward transfer -0.007, HellaSwag retention values) are presented without explicit baseline comparisons to standard continual pre-training methods, run-to-run variance, or statistical significance tests. This omission complicates assessment of whether the results reflect architectural isolation rather than domain similarity or evaluation timing.

minor comments (2)

[Abstract] The three HellaSwag retention numbers (0.506/0.504/0.510) are not explicitly mapped to the three model scales or regimes.
[Architectural insight] The description of input-conditioned projections in the Read/Write decomposition would benefit from a brief equation or pseudocode sketch for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us improve the clarity and rigor of our presentation. We address each major comment below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract (architectural insight)] Abstract (architectural insight): The >=99.59% L2-orthogonal gradient separation is reported between domain pairs, yet the no-forgetting claim requires isolation to persist cumulatively after each successive phase in the six-domain sequence. Pairwise metrics do not automatically guarantee that prior-domain subspaces remain unwritten after later updates (e.g., after JavaScript training, does the Prose subspace retain isolation?), which is load-bearing for the central architectural claim.

Authors: We agree that demonstrating cumulative isolation after the full sequence is essential to support the architectural claim. The TFGN Read/Write decomposition is constructed to enforce sequential orthogonality: each new domain's updates are projected onto a subspace orthogonal to the union of all prior domain subspaces, rather than relying solely on post-hoc pairwise checks. The reported >=99.59% figures were obtained after completing the entire six-domain sequence, which already incorporates the cumulative effect. To address the concern explicitly, we have revised the abstract and added a new paragraph in Section 3.2 together with a cumulative orthogonality matrix (Table S3) measured after the final domain, confirming minimum isolation of 99.52% across all prior pairs with no measurable degradation. revision: yes
Referee: [Evaluation metrics] Evaluation section: The concrete metrics (backward transfer -0.007, HellaSwag retention values) are presented without explicit baseline comparisons to standard continual pre-training methods, run-to-run variance, or statistical significance tests. This omission complicates assessment of whether the results reflect architectural isolation rather than domain similarity or evaluation timing.

Authors: The referee correctly notes that direct baselines would strengthen interpretability. While the manuscript deliberately focuses on the architectural removal of replay, task IDs, and Fisher penalties, we acknowledge the value of explicit comparisons. We have added a new subsection (Section 4.4) with baseline results from standard fine-tuning and a memory-efficient replay method on the 398M and 739M scales, showing substantially higher forgetting under those regimes. Regarding variance and significance, experiments used fixed seeds for reproducibility at LLM scale; we now report standard deviations from three independent runs at the two smaller scales and note the single-run limitation for the 9B experiments. Formal statistical tests were omitted because the observed differences (e.g., backward transfer near zero versus expected catastrophic forgetting) are large and consistent across scales and domains, but we have added a brief discussion of this point. revision: partial

Circularity Check

0 steps flagged

No circularity: results are empirical measurements on external benchmarks

full rationale

The paper presents TFGN as an architectural overlay whose Read/Write decomposition enables continual pre-training without replay or task IDs. All reported outcomes—backward transfer of -0.007, HellaSwag retention values, >=99.59% L2-orthogonal gradient separation, and cross-domain forward transfer—are framed as measured experimental results on standard external benchmarks across six domains and multiple model scales. No equations or derivations are shown that reduce these quantities to fitted parameters or self-referential definitions by construction. The architectural insight is stated as enabling the observed isolation, but the claims rest on empirical evaluation rather than tautological prediction. Any self-citation (e.g., to Dupoux et al.) is not load-bearing for the core performance numbers, which derive from held-out evaluations independent of the training procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, background axioms, or invented entities beyond the high-level description of the overlay and Read/Write split; standard transformer assumptions are implicitly used but not enumerated.

invented entities (1)

TFGN overlay with Read/Write decomposition no independent evidence
purpose: Produces input-conditioned parameter-efficient updates that preserve prior-domain subspaces
Core innovation introduced to achieve replay-free continual pre-training; no independent falsifiable evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5921 in / 1389 out tokens · 79239 ms · 2026-05-19T17:00:57.890196+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

>=99.59% L2-orthogonal gradient separation between domain pairs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 20 internal anchors

[1]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023. © Anurup Ganguli 2026 56 TFGN preprint v2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Lopez-Paz and M

D. Lopez-Paz and M. Ranzato. Gradient Episodic Memory for Continual Learning. NeurIPS, 2017

work page 2017
[3]

Dupoux, Y

E. Dupoux, Y. LeCun, and J. Malik. Why AI Systems Don’t Learn and What to Do About It: Lessons on Autonomous Learning from Cognitive Science. arXiv:2603.15381, 2026

work page arXiv 2026
[4]

Kirkpatrick et al

J. Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017

work page 2017
[5]

Aljundi et al

R. Aljundi et al. Memory Aware Synapses: Learning what (not) to forget. ECCV, 2018

work page 2018
[6]

Zenke, B

F. Zenke, B. Poole, and S. Ganguli. Continual Learning Through Synaptic Intelligence. ICML, 2017

work page 2017
[7]

Chaudhry et al

A. Chaudhry et al. Eﬀicient Lifelong Learning with A-GEM. ICLR, 2019

work page 2019
[8]

On Tiny Episodic Memories in Continual Learning

A. Chaudhry et al. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[9]

Buzzega et al

P. Buzzega et al. Dark Experience for General Continual Learning: A Strong, Simple Baseline. NeurIPS, 2020

work page 2020
[10]

Aljundi et al

R. Aljundi et al. Online Continual Learning with Maximally Interfered Retrieval. NeurIPS, 2019

work page 2019
[11]

Farajtabar et al

M. Farajtabar et al. Orthogonal Gradient Descent for Continual Learning. AISTATS, 2020

work page 2020
[12]

G. Saha, I. Garg, and K. Roy. Gradient Projection Memory for Continual Learning. ICLR, 2021

work page 2021
[13]

Wang et al

S. Wang et al. Training Networks in Null Space of Feature Covariance for Continual Learning. CVPR, 2021

work page 2021
[14]

Mallya and S

A. Mallya and S. Lazebnik. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. CVPR, 2018

work page 2018
[15]

Mallya, D

A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. ECCV, 2018

work page 2018
[16]

Serra et al

J. Serra et al. Overcoming Catastrophic Forgetting with Hard Attention to the Task. ICML, 2018

work page 2018
[17]

Progressive Neural Networks

A. Rusu et al. Progressive Neural Networks. arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Hu et al

E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022

work page 2022
[19]

Wang et al

X. Wang et al. Orthogonal Subspace Learning for Language Model Continual Learning (O-LoRA). Findings of EMNLP , 2023. arXiv:2310.14152

work page arXiv 2023
[20]

Qian, Y.-Z

Y.-Y. Qian, Y.-Z. Xu, Z.-Y. Zhang, P. Zhao, and Z.-H. Zhou. TreeLoRA: Eﬀicient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree. ICML, 2025. arXiv:2506.10355

work page arXiv 2025
[21]

Hoy and N

W. Hoy and N. Celik. STABLE: Gated Continual Learning for Large Language Models. arXiv:2510.16089, 2025

work page arXiv 2025
[22]

Chen et al

Y. Chen et al. LongLoRA: Eﬀicient Fine-tuning of Long-Context Large Language Models. ICLR, 2024

work page 2024
[23]

Chen et al

W. Chen et al. Lifelong Language Pretraining with Distribution-Specialized Experts (Lifelong-MoE). ICML, 2023. arXiv:2305.12281

work page arXiv 2023
[24]

Liu et al

Q. Liu et al. LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment. ACL, 2024

work page 2024
[25]

Smith et al

J. Smith et al. CODA-Prompt: Continual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning. CVPR, 2023

work page 2023
[26]

von Oswald et al

J. von Oswald et al. Continual Learning with Hypernetworks. ICLR, 2020

work page 2020
[27]

Beaulieu et al

S. Beaulieu et al. Learning to Continually Learn (ANML). ECAI, 2020

work page 2020
[28]

Javed and M

K. Javed and M. White. Meta-Learning Representations for Continual Learning (OML). NeurIPS, 2019

work page 2019
[29]

Miconi, K

T. Miconi, K. Stanley, and J. Clune. Differentiable Plasticity: Training plastic neural networks with backpropagation. ICML, 2018

work page 2018
[30]

Miconi, A

T. Miconi, A. Rawal, J. Clune, and K. Stanley. Backpropamine: Training self-modifying neural networks with differentiable neuromodulated plasticity. ICLR, 2020

work page 2020
[31]

Rodriguez et al

H. Rodriguez et al. Short-Term Plasticity Neurons Learning to Learn and Forget. ICML, 2022. arXiv:2206.14048

work page arXiv 2022
[32]

Miconi and K

T. Miconi and K. Kay. Neural mechanisms of relational learning and fast knowledge reassembly in plastic neural networks. Nature Neuroscience, 28:406–414, 2025. doi:10.1038/s41593-024-01852-8

work page doi:10.1038/s41593-024-01852-8 2025
[33]

Dohare et al

S. Dohare et al. Loss of plasticity in deep continual learning. Nature, 2024. © Anurup Ganguli 2026 57 TFGN preprint v2

work page 2024
[34]

Meng et al

K. Meng et al. Locating and Editing Factual Associations in GPT (ROME). NeurIPS, 2022

work page 2022
[35]

Meng et al

K. Meng et al. Mass-Editing Memory in a Transformer (MEMIT). ICLR, 2023

work page 2023
[36]

Jiang et al

H. Jiang et al. Neuron-Level Sequential Editing for Large Language Models. ACL, 2025. arXiv:2410.04045

work page arXiv 2025
[37]

Wang et al

P. Wang et al. WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models. NeurIPS, 2024

work page 2024
[38]

S. Park, S. Park, J. Kim, and H. Kim. MAKE: Memory-Associated Knowledge Editing. Transactions of the Association for Computational Linguistics , 13:938–952, 2025. doi:10.1162/TACL.a.26

work page doi:10.1162/tacl.a.26 2025
[39]

Y. Wang, T. Sun, C. Tang, et al. HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning. arXiv:2604.11214, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Shi et al

H. Shi et al. Continual Learning for Large Language Models: A Survey. ACM Computing Surveys , 2025

work page 2025
[41]

L. Wang, X. Zhang, H. Su, and J. Zhu. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE TPAMI, 46(8):5362–5383, 2024. arXiv:2302.00487

work page arXiv 2024
[42]

O. Y. L. Imanov. Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning. arXiv:2601.18699, 2026

work page arXiv 2026
[43]

Li and H.-Y

C.-A. Li and H.-Y. Lee. Examining Forgetting in Continual Pre-training of Aligned Large Language Models. arXiv:2401.03129, 2024

work page arXiv 2024
[44]

J. Chen, Z. Chen, J. Wang, K. Zhou, Y. Zhu, J. Jiang, Y. Min, W. X. Zhao, et al. Towards Effective and Eﬀicient Continual Pre-training of Large Language Models (Llama-3-SynE). arXiv:2407.18743, 2024

work page arXiv 2024
[45]

Abbes, G

I. Abbes, G. Subbaraj, M. Riemer, et al. Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models. arXiv:2508.01908, 2025

work page arXiv 2025
[46]

Šliogeris, P

V. Šliogeris, P. Daniušis, and A. Nakvosas. Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge. arXiv:2505.05946, 2025

work page arXiv 2025
[47]

X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, X. Yang, Z. Xi, R. Zheng, Y. Zou, T. Gui, Q. Zhang, X. Huang. TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models. arXiv:2310.06762, 2023

work page arXiv 2023
[48]

Zellers et al

R. Zellers et al. HellaSwag: Can a Machine Really Finish Your Sentence? ACL, 2019

work page 2019
[49]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

G. Penedo, H. Kydlíček, L. Ben Allal, A. Lozhkov, M. Mitchell, C. Raffel, L. von Werra, and T. Wolf. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

R. Li, L. Ben Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, et al. StarCoder: may the source be with you! arXiv:2305.06161, 2023. The StarCoderData training corpus is the deduplicated, decontaminated derivative of The Stack used here for both Python and JavaScript

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Paster, M

K. Paster, M. Dos Santos, Z. Azerbayev, and J. Ba. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text. arXiv:2310.06786, 2023

work page arXiv 2023
[52]

E. W. Sayers, J. Beck, E. E. Bolton, J. R. Brister, J. Chan, D. C. Comeau, et al. Database resources of the National Center for Biotechnology Information in 2024. Nucleic Acids Research , 52(D1):D33–D43, 2024

work page 2024
[53]

arXiv preprint arXiv:2309.09400 , year=

T. Nguyen, C. Van Nguyen, V. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv:2309.09400, LREC-COLING 2024

work page arXiv 2024
[54]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022. arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griﬀiths, Y. Cao, and K. Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS, 2023. arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian. Training Large Language Models to Reason in a Continuous Latent Space (Coconut). arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Control Tasks through World Models. Nature, 640:647–653, 2025. arXiv:2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, et al. Mastering Atari, Go, Chess and © Anurup Ganguli 2026 58 TFGN preprint v2 Shogi by Planning with a Learned Model (MuZero). Nature, 588(7839):604–609, 2020. arXiv:1911.08265

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Y. LeCun. A Path Towards Autonomous Machine Intelligence (JEPA). OpenReview, Version 0.9.2, 2022

work page 2022
[60]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, et al., and Y. LeCun. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Planning with Diffusion for Flexible Behavior Synthesis

M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine. Planning with Diffusion for Flexible Behavior Synthesis (Diffuser). ICML, 2022. arXiv:2205.09991

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eﬀicient Sparsity. JMLR, 23, 2022. arXiv:2101.03961

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, et al. DeepSeekMoE: Towards Ultimate Expert Special- ization in Mixture-of-Experts Language Models. ACL, 2024. arXiv:2401.06066

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, et al., D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS, 2023. arXiv:2306.03341

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

E. Todd, M. L. Li, A. Sen Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function Vectors in Large Language Models. ICLR, 2024. arXiv:2310.15213

work page arXiv 2024
[68]

Templeton, T

A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, et al., and T. Henighan. Scaling Monose- manticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread , Anthropic, May 2024

work page 2024
[69]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Rimsky, M. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering Llama 2 via Contrastive Activation Addition (CAA). ACL, 2024. arXiv:2312.06681

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in Language Models is Mediated by a Single Direction. NeurIPS, 2024. arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

question

Y. Zhang, B. Tang, T. Ju, S. Duan, and G. Liu. Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought. arXiv:2512.21711, 2025. A Condition Name Index © Anurup Ganguli 2026 59 TFGN preprint v2 T able 31: Canonical external names used throughout this paper, with backbone, regime, phase count, and per-phase token budget. “ER...

work page arXiv 2025

[1] [1]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023. © Anurup Ganguli 2026 56 TFGN preprint v2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Lopez-Paz and M

D. Lopez-Paz and M. Ranzato. Gradient Episodic Memory for Continual Learning. NeurIPS, 2017

work page 2017

[3] [3]

Dupoux, Y

E. Dupoux, Y. LeCun, and J. Malik. Why AI Systems Don’t Learn and What to Do About It: Lessons on Autonomous Learning from Cognitive Science. arXiv:2603.15381, 2026

work page arXiv 2026

[4] [4]

Kirkpatrick et al

J. Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017

work page 2017

[5] [5]

Aljundi et al

R. Aljundi et al. Memory Aware Synapses: Learning what (not) to forget. ECCV, 2018

work page 2018

[6] [6]

Zenke, B

F. Zenke, B. Poole, and S. Ganguli. Continual Learning Through Synaptic Intelligence. ICML, 2017

work page 2017

[7] [7]

Chaudhry et al

A. Chaudhry et al. Eﬀicient Lifelong Learning with A-GEM. ICLR, 2019

work page 2019

[8] [8]

On Tiny Episodic Memories in Continual Learning

A. Chaudhry et al. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[9] [9]

Buzzega et al

P. Buzzega et al. Dark Experience for General Continual Learning: A Strong, Simple Baseline. NeurIPS, 2020

work page 2020

[10] [10]

Aljundi et al

R. Aljundi et al. Online Continual Learning with Maximally Interfered Retrieval. NeurIPS, 2019

work page 2019

[11] [11]

Farajtabar et al

M. Farajtabar et al. Orthogonal Gradient Descent for Continual Learning. AISTATS, 2020

work page 2020

[12] [12]

G. Saha, I. Garg, and K. Roy. Gradient Projection Memory for Continual Learning. ICLR, 2021

work page 2021

[13] [13]

Wang et al

S. Wang et al. Training Networks in Null Space of Feature Covariance for Continual Learning. CVPR, 2021

work page 2021

[14] [14]

Mallya and S

A. Mallya and S. Lazebnik. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. CVPR, 2018

work page 2018

[15] [15]

Mallya, D

A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. ECCV, 2018

work page 2018

[16] [16]

Serra et al

J. Serra et al. Overcoming Catastrophic Forgetting with Hard Attention to the Task. ICML, 2018

work page 2018

[17] [17]

Progressive Neural Networks

A. Rusu et al. Progressive Neural Networks. arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

Hu et al

E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022

work page 2022

[19] [19]

Wang et al

X. Wang et al. Orthogonal Subspace Learning for Language Model Continual Learning (O-LoRA). Findings of EMNLP , 2023. arXiv:2310.14152

work page arXiv 2023

[20] [20]

Qian, Y.-Z

Y.-Y. Qian, Y.-Z. Xu, Z.-Y. Zhang, P. Zhao, and Z.-H. Zhou. TreeLoRA: Eﬀicient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree. ICML, 2025. arXiv:2506.10355

work page arXiv 2025

[21] [21]

Hoy and N

W. Hoy and N. Celik. STABLE: Gated Continual Learning for Large Language Models. arXiv:2510.16089, 2025

work page arXiv 2025

[22] [22]

Chen et al

Y. Chen et al. LongLoRA: Eﬀicient Fine-tuning of Long-Context Large Language Models. ICLR, 2024

work page 2024

[23] [23]

Chen et al

W. Chen et al. Lifelong Language Pretraining with Distribution-Specialized Experts (Lifelong-MoE). ICML, 2023. arXiv:2305.12281

work page arXiv 2023

[24] [24]

Liu et al

Q. Liu et al. LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment. ACL, 2024

work page 2024

[25] [25]

Smith et al

J. Smith et al. CODA-Prompt: Continual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning. CVPR, 2023

work page 2023

[26] [26]

von Oswald et al

J. von Oswald et al. Continual Learning with Hypernetworks. ICLR, 2020

work page 2020

[27] [27]

Beaulieu et al

S. Beaulieu et al. Learning to Continually Learn (ANML). ECAI, 2020

work page 2020

[28] [28]

Javed and M

K. Javed and M. White. Meta-Learning Representations for Continual Learning (OML). NeurIPS, 2019

work page 2019

[29] [29]

Miconi, K

T. Miconi, K. Stanley, and J. Clune. Differentiable Plasticity: Training plastic neural networks with backpropagation. ICML, 2018

work page 2018

[30] [30]

Miconi, A

T. Miconi, A. Rawal, J. Clune, and K. Stanley. Backpropamine: Training self-modifying neural networks with differentiable neuromodulated plasticity. ICLR, 2020

work page 2020

[31] [31]

Rodriguez et al

H. Rodriguez et al. Short-Term Plasticity Neurons Learning to Learn and Forget. ICML, 2022. arXiv:2206.14048

work page arXiv 2022

[32] [32]

Miconi and K

T. Miconi and K. Kay. Neural mechanisms of relational learning and fast knowledge reassembly in plastic neural networks. Nature Neuroscience, 28:406–414, 2025. doi:10.1038/s41593-024-01852-8

work page doi:10.1038/s41593-024-01852-8 2025

[33] [33]

Dohare et al

S. Dohare et al. Loss of plasticity in deep continual learning. Nature, 2024. © Anurup Ganguli 2026 57 TFGN preprint v2

work page 2024

[34] [34]

Meng et al

K. Meng et al. Locating and Editing Factual Associations in GPT (ROME). NeurIPS, 2022

work page 2022

[35] [35]

Meng et al

K. Meng et al. Mass-Editing Memory in a Transformer (MEMIT). ICLR, 2023

work page 2023

[36] [36]

Jiang et al

H. Jiang et al. Neuron-Level Sequential Editing for Large Language Models. ACL, 2025. arXiv:2410.04045

work page arXiv 2025

[37] [37]

Wang et al

P. Wang et al. WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models. NeurIPS, 2024

work page 2024

[38] [38]

S. Park, S. Park, J. Kim, and H. Kim. MAKE: Memory-Associated Knowledge Editing. Transactions of the Association for Computational Linguistics , 13:938–952, 2025. doi:10.1162/TACL.a.26

work page doi:10.1162/tacl.a.26 2025

[39] [39]

Y. Wang, T. Sun, C. Tang, et al. HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning. arXiv:2604.11214, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Shi et al

H. Shi et al. Continual Learning for Large Language Models: A Survey. ACM Computing Surveys , 2025

work page 2025

[41] [41]

L. Wang, X. Zhang, H. Su, and J. Zhu. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE TPAMI, 46(8):5362–5383, 2024. arXiv:2302.00487

work page arXiv 2024

[42] [42]

O. Y. L. Imanov. Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning. arXiv:2601.18699, 2026

work page arXiv 2026

[43] [43]

Li and H.-Y

C.-A. Li and H.-Y. Lee. Examining Forgetting in Continual Pre-training of Aligned Large Language Models. arXiv:2401.03129, 2024

work page arXiv 2024

[44] [44]

J. Chen, Z. Chen, J. Wang, K. Zhou, Y. Zhu, J. Jiang, Y. Min, W. X. Zhao, et al. Towards Effective and Eﬀicient Continual Pre-training of Large Language Models (Llama-3-SynE). arXiv:2407.18743, 2024

work page arXiv 2024

[45] [45]

Abbes, G

I. Abbes, G. Subbaraj, M. Riemer, et al. Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models. arXiv:2508.01908, 2025

work page arXiv 2025

[46] [46]

Šliogeris, P

V. Šliogeris, P. Daniušis, and A. Nakvosas. Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge. arXiv:2505.05946, 2025

work page arXiv 2025

[47] [47]

X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, X. Yang, Z. Xi, R. Zheng, Y. Zou, T. Gui, Q. Zhang, X. Huang. TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models. arXiv:2310.06762, 2023

work page arXiv 2023

[48] [48]

Zellers et al

R. Zellers et al. HellaSwag: Can a Machine Really Finish Your Sentence? ACL, 2019

work page 2019

[49] [49]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

G. Penedo, H. Kydlíček, L. Ben Allal, A. Lozhkov, M. Mitchell, C. Raffel, L. von Werra, and T. Wolf. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

R. Li, L. Ben Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, et al. StarCoder: may the source be with you! arXiv:2305.06161, 2023. The StarCoderData training corpus is the deduplicated, decontaminated derivative of The Stack used here for both Python and JavaScript

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Paster, M

K. Paster, M. Dos Santos, Z. Azerbayev, and J. Ba. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text. arXiv:2310.06786, 2023

work page arXiv 2023

[52] [52]

E. W. Sayers, J. Beck, E. E. Bolton, J. R. Brister, J. Chan, D. C. Comeau, et al. Database resources of the National Center for Biotechnology Information in 2024. Nucleic Acids Research , 52(D1):D33–D43, 2024

work page 2024

[53] [53]

arXiv preprint arXiv:2309.09400 , year=

T. Nguyen, C. Van Nguyen, V. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv:2309.09400, LREC-COLING 2024

work page arXiv 2024

[54] [54]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022. arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[55] [55]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griﬀiths, Y. Cao, and K. Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS, 2023. arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian. Training Large Language Models to Reason in a Continuous Latent Space (Coconut). arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Control Tasks through World Models. Nature, 640:647–653, 2025. arXiv:2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, et al. Mastering Atari, Go, Chess and © Anurup Ganguli 2026 58 TFGN preprint v2 Shogi by Planning with a Learned Model (MuZero). Nature, 588(7839):604–609, 2020. arXiv:1911.08265

work page internal anchor Pith review Pith/arXiv arXiv 2026

[59] [59]

Y. LeCun. A Path Towards Autonomous Machine Intelligence (JEPA). OpenReview, Version 0.9.2, 2022

work page 2022

[60] [60]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, et al., and Y. LeCun. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Planning with Diffusion for Flexible Behavior Synthesis

M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine. Planning with Diffusion for Flexible Behavior Synthesis (Diffuser). ICML, 2022. arXiv:2205.09991

work page internal anchor Pith review Pith/arXiv arXiv 2022

[62] [62]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eﬀicient Sparsity. JMLR, 23, 2022. arXiv:2101.03961

work page internal anchor Pith review Pith/arXiv arXiv 2022

[63] [63]

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, et al. DeepSeekMoE: Towards Ultimate Expert Special- ization in Mixture-of-Experts Language Models. ACL, 2024. arXiv:2401.06066

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, et al., D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS, 2023. arXiv:2306.03341

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

E. Todd, M. L. Li, A. Sen Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function Vectors in Large Language Models. ICLR, 2024. arXiv:2310.15213

work page arXiv 2024

[68] [68]

Templeton, T

A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, et al., and T. Henighan. Scaling Monose- manticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread , Anthropic, May 2024

work page 2024

[69] [69]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Rimsky, M. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering Llama 2 via Contrastive Activation Addition (CAA). ACL, 2024. arXiv:2312.06681

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in Language Models is Mediated by a Single Direction. NeurIPS, 2024. arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

question

Y. Zhang, B. Tang, T. Ju, S. Duan, and G. Liu. Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought. arXiv:2512.21711, 2025. A Condition Name Index © Anurup Ganguli 2026 59 TFGN preprint v2 T able 31: Canonical external names used throughout this paper, with backbone, regime, phase count, and per-phase token budget. “ER...

work page arXiv 2025