TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Pith reviewed 2026-05-19 17:00 UTC · model grok-4.3
The pith
TFGN is an architectural overlay that allows large language models to continually pre-train on new text domains without catastrophic forgetting, replay, or task labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TFGN achieves a backward transfer of -0.007 on LLaMA 3.1 8B Retrofit with HellaSwag retention scores of 0.506/0.504/0.510 and at least 99.59 percent L2-orthogonal gradient separation between domain pairs, all without replay, task IDs, or Fisher penalty. The same setup yields positive cross-domain forward transfer, including a 26.8 percent drop in held-out JavaScript perplexity from Python training at the 8B scale and 62 percent at GPT-2 Medium from scratch.
What carries the argument
The Read/Write decomposition, an architectural overlay for transformers where the forward pass is fully dense but cross-domain parameter updates are structured so that prior-domain subspaces are not written to.
If this is right
- Continual pre-training on heterogeneous domains becomes possible at LLM scale with minimal forgetting.
- Positive forward transfer occurs across domains even without task boundaries.
- Closed-loop meta-control can further reduce forgetting by up to 81 percent at smaller scales.
- Operator-level plan vectors can reshape model behavior at over 99.96 percent cosine fidelity.
Where Pith is reading between the lines
- This approach could allow models to process continuous streams of new data while maintaining performance on earlier tasks.
- The high degree of gradient separation might inspire similar designs in other machine learning domains.
- The closed-loop meta-control layer points toward fully autonomous continual learning systems.
- The operator-level plan vector could enable dynamic adaptation of model behavior based on latent plans.
Load-bearing premise
The Read/Write decomposition can be realized such that cross-domain parameter updates are structured to leave prior-domain subspaces unwritten while still permitting effective learning on new domains.
What would settle it
Training on one new domain and then observing a performance drop larger than -0.007 on a prior domain, or measuring gradient inner products that fall below 99.59 percent L2-orthogonality between domain pairs, would falsify the no-forgetting claim.
Figures
read the original abstract
Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TFGN as an architectural overlay on transformer LLMs that enables continual pre-training across heterogeneous domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) without replay, task IDs, or Fisher penalties. It reports a backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention of 0.506/0.504/0.510, >=99.59% L2-orthogonal gradient separation between domain pairs, and positive forward transfer (e.g., 26.8% JavaScript PPL drop at LLaMA-8B from Python training) across three scales and two regimes. Extensions include a closed-loop meta-control layer and an operator-level plan vector.
Significance. If the Read/Write decomposition maintains persistent subspace isolation, this would constitute a meaningful architectural advance in continual learning at LLM scale by removing reliance on replay or regularization. The multi-scale evaluation (398M to 9B), demonstration of forward transfer, and extensions linking to meta-control systems are strengths that could influence future work on autonomous continual pre-training.
major comments (2)
- [Abstract (architectural insight)] Abstract (architectural insight): The >=99.59% L2-orthogonal gradient separation is reported between domain pairs, yet the no-forgetting claim requires isolation to persist cumulatively after each successive phase in the six-domain sequence. Pairwise metrics do not automatically guarantee that prior-domain subspaces remain unwritten after later updates (e.g., after JavaScript training, does the Prose subspace retain isolation?), which is load-bearing for the central architectural claim.
- [Evaluation metrics] Evaluation section: The concrete metrics (backward transfer -0.007, HellaSwag retention values) are presented without explicit baseline comparisons to standard continual pre-training methods, run-to-run variance, or statistical significance tests. This omission complicates assessment of whether the results reflect architectural isolation rather than domain similarity or evaluation timing.
minor comments (2)
- [Abstract] The three HellaSwag retention numbers (0.506/0.504/0.510) are not explicitly mapped to the three model scales or regimes.
- [Architectural insight] The description of input-conditioned projections in the Read/Write decomposition would benefit from a brief equation or pseudocode sketch for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which have helped us improve the clarity and rigor of our presentation. We address each major comment below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Abstract (architectural insight)] Abstract (architectural insight): The >=99.59% L2-orthogonal gradient separation is reported between domain pairs, yet the no-forgetting claim requires isolation to persist cumulatively after each successive phase in the six-domain sequence. Pairwise metrics do not automatically guarantee that prior-domain subspaces remain unwritten after later updates (e.g., after JavaScript training, does the Prose subspace retain isolation?), which is load-bearing for the central architectural claim.
Authors: We agree that demonstrating cumulative isolation after the full sequence is essential to support the architectural claim. The TFGN Read/Write decomposition is constructed to enforce sequential orthogonality: each new domain's updates are projected onto a subspace orthogonal to the union of all prior domain subspaces, rather than relying solely on post-hoc pairwise checks. The reported >=99.59% figures were obtained after completing the entire six-domain sequence, which already incorporates the cumulative effect. To address the concern explicitly, we have revised the abstract and added a new paragraph in Section 3.2 together with a cumulative orthogonality matrix (Table S3) measured after the final domain, confirming minimum isolation of 99.52% across all prior pairs with no measurable degradation. revision: yes
-
Referee: [Evaluation metrics] Evaluation section: The concrete metrics (backward transfer -0.007, HellaSwag retention values) are presented without explicit baseline comparisons to standard continual pre-training methods, run-to-run variance, or statistical significance tests. This omission complicates assessment of whether the results reflect architectural isolation rather than domain similarity or evaluation timing.
Authors: The referee correctly notes that direct baselines would strengthen interpretability. While the manuscript deliberately focuses on the architectural removal of replay, task IDs, and Fisher penalties, we acknowledge the value of explicit comparisons. We have added a new subsection (Section 4.4) with baseline results from standard fine-tuning and a memory-efficient replay method on the 398M and 739M scales, showing substantially higher forgetting under those regimes. Regarding variance and significance, experiments used fixed seeds for reproducibility at LLM scale; we now report standard deviations from three independent runs at the two smaller scales and note the single-run limitation for the 9B experiments. Formal statistical tests were omitted because the observed differences (e.g., backward transfer near zero versus expected catastrophic forgetting) are large and consistent across scales and domains, but we have added a brief discussion of this point. revision: partial
Circularity Check
No circularity: results are empirical measurements on external benchmarks
full rationale
The paper presents TFGN as an architectural overlay whose Read/Write decomposition enables continual pre-training without replay or task IDs. All reported outcomes—backward transfer of -0.007, HellaSwag retention values, >=99.59% L2-orthogonal gradient separation, and cross-domain forward transfer—are framed as measured experimental results on standard external benchmarks across six domains and multiple model scales. No equations or derivations are shown that reduce these quantities to fitted parameters or self-referential definitions by construction. The architectural insight is stated as enabling the observed isolation, but the claims rest on empirical evaluation rather than tautological prediction. Any self-citation (e.g., to Dupoux et al.) is not load-bearing for the core performance numbers, which derive from held-out evaluations independent of the training procedure itself.
Axiom & Free-Parameter Ledger
invented entities (1)
-
TFGN overlay with Read/Write decomposition
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
>=99.59% L2-orthogonal gradient separation between domain pairs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023. © Anurup Ganguli 2026 56 TFGN preprint v2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
D. Lopez-Paz and M. Ranzato. Gradient Episodic Memory for Continual Learning. NeurIPS, 2017
work page 2017
- [3]
-
[4]
J. Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017
work page 2017
-
[5]
R. Aljundi et al. Memory Aware Synapses: Learning what (not) to forget. ECCV, 2018
work page 2018
- [6]
- [7]
-
[8]
On Tiny Episodic Memories in Continual Learning
A. Chaudhry et al. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[9]
P. Buzzega et al. Dark Experience for General Continual Learning: A Strong, Simple Baseline. NeurIPS, 2020
work page 2020
-
[10]
R. Aljundi et al. Online Continual Learning with Maximally Interfered Retrieval. NeurIPS, 2019
work page 2019
-
[11]
M. Farajtabar et al. Orthogonal Gradient Descent for Continual Learning. AISTATS, 2020
work page 2020
-
[12]
G. Saha, I. Garg, and K. Roy. Gradient Projection Memory for Continual Learning. ICLR, 2021
work page 2021
-
[13]
S. Wang et al. Training Networks in Null Space of Feature Covariance for Continual Learning. CVPR, 2021
work page 2021
-
[14]
A. Mallya and S. Lazebnik. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. CVPR, 2018
work page 2018
- [15]
-
[16]
J. Serra et al. Overcoming Catastrophic Forgetting with Hard Attention to the Task. ICML, 2018
work page 2018
-
[17]
A. Rusu et al. Progressive Neural Networks. arXiv:1606.04671, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [18]
-
[19]
X. Wang et al. Orthogonal Subspace Learning for Language Model Continual Learning (O-LoRA). Findings of EMNLP , 2023. arXiv:2310.14152
-
[20]
Y.-Y. Qian, Y.-Z. Xu, Z.-Y. Zhang, P. Zhao, and Z.-H. Zhou. TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree. ICML, 2025. arXiv:2506.10355
- [21]
-
[22]
Y. Chen et al. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. ICLR, 2024
work page 2024
-
[23]
W. Chen et al. Lifelong Language Pretraining with Distribution-Specialized Experts (Lifelong-MoE). ICML, 2023. arXiv:2305.12281
- [24]
-
[25]
J. Smith et al. CODA-Prompt: Continual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning. CVPR, 2023
work page 2023
-
[26]
J. von Oswald et al. Continual Learning with Hypernetworks. ICLR, 2020
work page 2020
- [27]
-
[28]
K. Javed and M. White. Meta-Learning Representations for Continual Learning (OML). NeurIPS, 2019
work page 2019
- [29]
- [30]
-
[31]
H. Rodriguez et al. Short-Term Plasticity Neurons Learning to Learn and Forget. ICML, 2022. arXiv:2206.14048
-
[32]
T. Miconi and K. Kay. Neural mechanisms of relational learning and fast knowledge reassembly in plastic neural networks. Nature Neuroscience, 28:406–414, 2025. doi:10.1038/s41593-024-01852-8
-
[33]
S. Dohare et al. Loss of plasticity in deep continual learning. Nature, 2024. © Anurup Ganguli 2026 57 TFGN preprint v2
work page 2024
-
[34]
K. Meng et al. Locating and Editing Factual Associations in GPT (ROME). NeurIPS, 2022
work page 2022
- [35]
-
[36]
H. Jiang et al. Neuron-Level Sequential Editing for Large Language Models. ACL, 2025. arXiv:2410.04045
-
[37]
P. Wang et al. WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models. NeurIPS, 2024
work page 2024
-
[38]
S. Park, S. Park, J. Kim, and H. Kim. MAKE: Memory-Associated Knowledge Editing. Transactions of the Association for Computational Linguistics , 13:938–952, 2025. doi:10.1162/TACL.a.26
-
[39]
Y. Wang, T. Sun, C. Tang, et al. HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning. arXiv:2604.11214, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [40]
- [41]
- [42]
-
[43]
C.-A. Li and H.-Y. Lee. Examining Forgetting in Continual Pre-training of Aligned Large Language Models. arXiv:2401.03129, 2024
- [44]
- [45]
-
[46]
V. Šliogeris, P. Daniušis, and A. Nakvosas. Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge. arXiv:2505.05946, 2025
- [47]
-
[48]
R. Zellers et al. HellaSwag: Can a Machine Really Finish Your Sentence? ACL, 2019
work page 2019
-
[49]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
G. Penedo, H. Kydlíček, L. Ben Allal, A. Lozhkov, M. Mitchell, C. Raffel, L. von Werra, and T. Wolf. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
R. Li, L. Ben Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, et al. StarCoder: may the source be with you! arXiv:2305.06161, 2023. The StarCoderData training corpus is the deduplicated, decontaminated derivative of The Stack used here for both Python and JavaScript
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [51]
-
[52]
E. W. Sayers, J. Beck, E. E. Bolton, J. R. Brister, J. Chan, D. C. Comeau, et al. Database resources of the National Center for Biotechnology Information in 2024. Nucleic Acids Research , 52(D1):D33–D43, 2024
work page 2024
-
[53]
arXiv preprint arXiv:2309.09400 , year=
T. Nguyen, C. Van Nguyen, V. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv:2309.09400, LREC-COLING 2024
-
[54]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022. arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS, 2023. arXiv:2305.10601
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian. Training Large Language Models to Reason in a Continuous Latent Space (Coconut). arXiv:2412.06769, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Control Tasks through World Models. Nature, 640:647–653, 2025. arXiv:2301.04104
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, et al. Mastering Atari, Go, Chess and © Anurup Ganguli 2026 58 TFGN preprint v2 Shogi by Planning with a Learned Model (MuZero). Nature, 588(7839):604–609, 2020. arXiv:1911.08265
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Y. LeCun. A Path Towards Autonomous Machine Intelligence (JEPA). OpenReview, Version 0.9.2, 2022
work page 2022
-
[60]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, et al., and Y. LeCun. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Planning with Diffusion for Flexible Behavior Synthesis
M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine. Planning with Diffusion for Flexible Behavior Synthesis (Diffuser). ICML, 2022. arXiv:2205.09991
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, and N. Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 23, 2022. arXiv:2101.03961
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, et al. DeepSeekMoE: Towards Ultimate Expert Special- ization in Mixture-of-Experts Language Models. ACL, 2024. arXiv:2401.06066
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, et al., D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS, 2023. arXiv:2306.03341
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [67]
-
[68]
A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, et al., and T. Henighan. Scaling Monose- manticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread , Anthropic, May 2024
work page 2024
-
[69]
Steering Llama 2 via Contrastive Activation Addition
N. Panickssery, N. Rimsky, M. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering Llama 2 via Contrastive Activation Addition (CAA). ACL, 2024. arXiv:2312.06681
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Refusal in Language Models Is Mediated by a Single Direction
A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in Language Models is Mediated by a Single Direction. NeurIPS, 2024. arXiv:2406.11717
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Y. Zhang, B. Tang, T. Ju, S. Duan, and G. Liu. Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought. arXiv:2512.21711, 2025. A Condition Name Index © Anurup Ganguli 2026 59 TFGN preprint v2 T able 31: Canonical external names used throughout this paper, with backbone, regime, phase count, and per-phase token budget. “ER...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.