pith. sign in

arxiv: 2605.15053 · v2 · pith:KBVDFN5Qnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Pith reviewed 2026-05-19 17:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual pre-trainingcatastrophic forgettinglarge language modelstransformer overlaystask-free learningreplay-free methodsorthogonal gradientsmeta-control layers
0
0 comments X

The pith

TFGN is an architectural overlay that allows large language models to continually pre-train on new text domains without catastrophic forgetting, replay, or task labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TFGN as a solution to the problem of continually pre-training LLMs on diverse text domains without replay buffers, task identifiers, or scaling penalties. It uses a Read/Write decomposition in which the forward pass stays fully dense but updates are constrained to avoid overwriting prior domain subspaces. A sympathetic reader would care because this could enable models to learn from ongoing data streams while retaining earlier knowledge, addressing a key barrier to lifelong learning in AI. Results include backward transfer near zero and high retention on benchmarks like HellaSwag across scales up to 9B parameters on domains such as math, code, and biomedical text. It also demonstrates positive forward transfer between domains and includes extensions for meta-control and planning.

Core claim

TFGN achieves a backward transfer of -0.007 on LLaMA 3.1 8B Retrofit with HellaSwag retention scores of 0.506/0.504/0.510 and at least 99.59 percent L2-orthogonal gradient separation between domain pairs, all without replay, task IDs, or Fisher penalty. The same setup yields positive cross-domain forward transfer, including a 26.8 percent drop in held-out JavaScript perplexity from Python training at the 8B scale and 62 percent at GPT-2 Medium from scratch.

What carries the argument

The Read/Write decomposition, an architectural overlay for transformers where the forward pass is fully dense but cross-domain parameter updates are structured so that prior-domain subspaces are not written to.

If this is right

  • Continual pre-training on heterogeneous domains becomes possible at LLM scale with minimal forgetting.
  • Positive forward transfer occurs across domains even without task boundaries.
  • Closed-loop meta-control can further reduce forgetting by up to 81 percent at smaller scales.
  • Operator-level plan vectors can reshape model behavior at over 99.96 percent cosine fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could allow models to process continuous streams of new data while maintaining performance on earlier tasks.
  • The high degree of gradient separation might inspire similar designs in other machine learning domains.
  • The closed-loop meta-control layer points toward fully autonomous continual learning systems.
  • The operator-level plan vector could enable dynamic adaptation of model behavior based on latent plans.

Load-bearing premise

The Read/Write decomposition can be realized such that cross-domain parameter updates are structured to leave prior-domain subspaces unwritten while still permitting effective learning on new domains.

What would settle it

Training on one new domain and then observing a performance drop larger than -0.007 on a prior domain, or measuring gradient inner products that fall below 99.59 percent L2-orthogonality between domain pairs, would falsify the no-forgetting claim.

Figures

Figures reproduced from arXiv: 2605.15053 by Anurup Ganguli.

Figure 1
Figure 1. Figure 1: Backward transfer across scales and regimes. [PITH_FULL_IMAGE:figures/full_fig_p026_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HellaSwag retention across continual phases. [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gradient orthogonality across TFGN conditions. [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Figure E.A.2 — Three-axis decomposition of the Extension A 81% reduction. [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Figure E.A.1 — Extension A 11-condition BWT ladder across Tiers A, B, [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Figure E.A.0 — Closed-loop self-regulation (capability schematic). [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Figure E.B.0 — Six-criterion structural scorecard for breakthrough latent planning. [PITH_FULL_IMAGE:figures/full_fig_p046_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Figure E.B.1 — Extension B per-target reshape fidelity at [PITH_FULL_IMAGE:figures/full_fig_p048_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Figure E.B.B — Plan-vector measurement battery. [PITH_FULL_IMAGE:figures/full_fig_p049_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Figure E.B.3 — Sub-task injection rate, Python and Math sub-tasks. [PITH_FULL_IMAGE:figures/full_fig_p051_10.png] view at source ↗
read the original abstract

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TFGN as an architectural overlay on transformer LLMs that enables continual pre-training across heterogeneous domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) without replay, task IDs, or Fisher penalties. It reports a backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention of 0.506/0.504/0.510, >=99.59% L2-orthogonal gradient separation between domain pairs, and positive forward transfer (e.g., 26.8% JavaScript PPL drop at LLaMA-8B from Python training) across three scales and two regimes. Extensions include a closed-loop meta-control layer and an operator-level plan vector.

Significance. If the Read/Write decomposition maintains persistent subspace isolation, this would constitute a meaningful architectural advance in continual learning at LLM scale by removing reliance on replay or regularization. The multi-scale evaluation (398M to 9B), demonstration of forward transfer, and extensions linking to meta-control systems are strengths that could influence future work on autonomous continual pre-training.

major comments (2)
  1. [Abstract (architectural insight)] Abstract (architectural insight): The >=99.59% L2-orthogonal gradient separation is reported between domain pairs, yet the no-forgetting claim requires isolation to persist cumulatively after each successive phase in the six-domain sequence. Pairwise metrics do not automatically guarantee that prior-domain subspaces remain unwritten after later updates (e.g., after JavaScript training, does the Prose subspace retain isolation?), which is load-bearing for the central architectural claim.
  2. [Evaluation metrics] Evaluation section: The concrete metrics (backward transfer -0.007, HellaSwag retention values) are presented without explicit baseline comparisons to standard continual pre-training methods, run-to-run variance, or statistical significance tests. This omission complicates assessment of whether the results reflect architectural isolation rather than domain similarity or evaluation timing.
minor comments (2)
  1. [Abstract] The three HellaSwag retention numbers (0.506/0.504/0.510) are not explicitly mapped to the three model scales or regimes.
  2. [Architectural insight] The description of input-conditioned projections in the Read/Write decomposition would benefit from a brief equation or pseudocode sketch for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us improve the clarity and rigor of our presentation. We address each major comment below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract (architectural insight)] Abstract (architectural insight): The >=99.59% L2-orthogonal gradient separation is reported between domain pairs, yet the no-forgetting claim requires isolation to persist cumulatively after each successive phase in the six-domain sequence. Pairwise metrics do not automatically guarantee that prior-domain subspaces remain unwritten after later updates (e.g., after JavaScript training, does the Prose subspace retain isolation?), which is load-bearing for the central architectural claim.

    Authors: We agree that demonstrating cumulative isolation after the full sequence is essential to support the architectural claim. The TFGN Read/Write decomposition is constructed to enforce sequential orthogonality: each new domain's updates are projected onto a subspace orthogonal to the union of all prior domain subspaces, rather than relying solely on post-hoc pairwise checks. The reported >=99.59% figures were obtained after completing the entire six-domain sequence, which already incorporates the cumulative effect. To address the concern explicitly, we have revised the abstract and added a new paragraph in Section 3.2 together with a cumulative orthogonality matrix (Table S3) measured after the final domain, confirming minimum isolation of 99.52% across all prior pairs with no measurable degradation. revision: yes

  2. Referee: [Evaluation metrics] Evaluation section: The concrete metrics (backward transfer -0.007, HellaSwag retention values) are presented without explicit baseline comparisons to standard continual pre-training methods, run-to-run variance, or statistical significance tests. This omission complicates assessment of whether the results reflect architectural isolation rather than domain similarity or evaluation timing.

    Authors: The referee correctly notes that direct baselines would strengthen interpretability. While the manuscript deliberately focuses on the architectural removal of replay, task IDs, and Fisher penalties, we acknowledge the value of explicit comparisons. We have added a new subsection (Section 4.4) with baseline results from standard fine-tuning and a memory-efficient replay method on the 398M and 739M scales, showing substantially higher forgetting under those regimes. Regarding variance and significance, experiments used fixed seeds for reproducibility at LLM scale; we now report standard deviations from three independent runs at the two smaller scales and note the single-run limitation for the 9B experiments. Formal statistical tests were omitted because the observed differences (e.g., backward transfer near zero versus expected catastrophic forgetting) are large and consistent across scales and domains, but we have added a brief discussion of this point. revision: partial

Circularity Check

0 steps flagged

No circularity: results are empirical measurements on external benchmarks

full rationale

The paper presents TFGN as an architectural overlay whose Read/Write decomposition enables continual pre-training without replay or task IDs. All reported outcomes—backward transfer of -0.007, HellaSwag retention values, >=99.59% L2-orthogonal gradient separation, and cross-domain forward transfer—are framed as measured experimental results on standard external benchmarks across six domains and multiple model scales. No equations or derivations are shown that reduce these quantities to fitted parameters or self-referential definitions by construction. The architectural insight is stated as enabling the observed isolation, but the claims rest on empirical evaluation rather than tautological prediction. Any self-citation (e.g., to Dupoux et al.) is not load-bearing for the core performance numbers, which derive from held-out evaluations independent of the training procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, background axioms, or invented entities beyond the high-level description of the overlay and Read/Write split; standard transformer assumptions are implicitly used but not enumerated.

invented entities (1)
  • TFGN overlay with Read/Write decomposition no independent evidence
    purpose: Produces input-conditioned parameter-efficient updates that preserve prior-domain subspaces
    Core innovation introduced to achieve replay-free continual pre-training; no independent falsifiable evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5921 in / 1389 out tokens · 79239 ms · 2026-05-19T17:00:57.890196+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023. © Anurup Ganguli 2026 56 TFGN preprint v2

  2. [2]

    Lopez-Paz and M

    D. Lopez-Paz and M. Ranzato. Gradient Episodic Memory for Continual Learning. NeurIPS, 2017

  3. [3]

    Dupoux, Y

    E. Dupoux, Y. LeCun, and J. Malik. Why AI Systems Don’t Learn and What to Do About It: Lessons on Autonomous Learning from Cognitive Science. arXiv:2603.15381, 2026

  4. [4]

    Kirkpatrick et al

    J. Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. PNAS, 2017

  5. [5]

    Aljundi et al

    R. Aljundi et al. Memory Aware Synapses: Learning what (not) to forget. ECCV, 2018

  6. [6]

    Zenke, B

    F. Zenke, B. Poole, and S. Ganguli. Continual Learning Through Synaptic Intelligence. ICML, 2017

  7. [7]

    Chaudhry et al

    A. Chaudhry et al. Efficient Lifelong Learning with A-GEM. ICLR, 2019

  8. [8]

    On Tiny Episodic Memories in Continual Learning

    A. Chaudhry et al. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486, 2019

  9. [9]

    Buzzega et al

    P. Buzzega et al. Dark Experience for General Continual Learning: A Strong, Simple Baseline. NeurIPS, 2020

  10. [10]

    Aljundi et al

    R. Aljundi et al. Online Continual Learning with Maximally Interfered Retrieval. NeurIPS, 2019

  11. [11]

    Farajtabar et al

    M. Farajtabar et al. Orthogonal Gradient Descent for Continual Learning. AISTATS, 2020

  12. [12]

    G. Saha, I. Garg, and K. Roy. Gradient Projection Memory for Continual Learning. ICLR, 2021

  13. [13]

    Wang et al

    S. Wang et al. Training Networks in Null Space of Feature Covariance for Continual Learning. CVPR, 2021

  14. [14]

    Mallya and S

    A. Mallya and S. Lazebnik. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. CVPR, 2018

  15. [15]

    Mallya, D

    A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. ECCV, 2018

  16. [16]

    Serra et al

    J. Serra et al. Overcoming Catastrophic Forgetting with Hard Attention to the Task. ICML, 2018

  17. [17]

    Progressive Neural Networks

    A. Rusu et al. Progressive Neural Networks. arXiv:1606.04671, 2016

  18. [18]

    Hu et al

    E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022

  19. [19]

    Wang et al

    X. Wang et al. Orthogonal Subspace Learning for Language Model Continual Learning (O-LoRA). Findings of EMNLP , 2023. arXiv:2310.14152

  20. [20]

    Qian, Y.-Z

    Y.-Y. Qian, Y.-Z. Xu, Z.-Y. Zhang, P. Zhao, and Z.-H. Zhou. TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree. ICML, 2025. arXiv:2506.10355

  21. [21]

    Hoy and N

    W. Hoy and N. Celik. STABLE: Gated Continual Learning for Large Language Models. arXiv:2510.16089, 2025

  22. [22]

    Chen et al

    Y. Chen et al. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. ICLR, 2024

  23. [23]

    Chen et al

    W. Chen et al. Lifelong Language Pretraining with Distribution-Specialized Experts (Lifelong-MoE). ICML, 2023. arXiv:2305.12281

  24. [24]

    Liu et al

    Q. Liu et al. LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment. ACL, 2024

  25. [25]

    Smith et al

    J. Smith et al. CODA-Prompt: Continual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning. CVPR, 2023

  26. [26]

    von Oswald et al

    J. von Oswald et al. Continual Learning with Hypernetworks. ICLR, 2020

  27. [27]

    Beaulieu et al

    S. Beaulieu et al. Learning to Continually Learn (ANML). ECAI, 2020

  28. [28]

    Javed and M

    K. Javed and M. White. Meta-Learning Representations for Continual Learning (OML). NeurIPS, 2019

  29. [29]

    Miconi, K

    T. Miconi, K. Stanley, and J. Clune. Differentiable Plasticity: Training plastic neural networks with backpropagation. ICML, 2018

  30. [30]

    Miconi, A

    T. Miconi, A. Rawal, J. Clune, and K. Stanley. Backpropamine: Training self-modifying neural networks with differentiable neuromodulated plasticity. ICLR, 2020

  31. [31]

    Rodriguez et al

    H. Rodriguez et al. Short-Term Plasticity Neurons Learning to Learn and Forget. ICML, 2022. arXiv:2206.14048

  32. [32]

    Miconi and K

    T. Miconi and K. Kay. Neural mechanisms of relational learning and fast knowledge reassembly in plastic neural networks. Nature Neuroscience, 28:406–414, 2025. doi:10.1038/s41593-024-01852-8

  33. [33]

    Dohare et al

    S. Dohare et al. Loss of plasticity in deep continual learning. Nature, 2024. © Anurup Ganguli 2026 57 TFGN preprint v2

  34. [34]

    Meng et al

    K. Meng et al. Locating and Editing Factual Associations in GPT (ROME). NeurIPS, 2022

  35. [35]

    Meng et al

    K. Meng et al. Mass-Editing Memory in a Transformer (MEMIT). ICLR, 2023

  36. [36]

    Jiang et al

    H. Jiang et al. Neuron-Level Sequential Editing for Large Language Models. ACL, 2025. arXiv:2410.04045

  37. [37]

    Wang et al

    P. Wang et al. WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models. NeurIPS, 2024

  38. [38]

    S. Park, S. Park, J. Kim, and H. Kim. MAKE: Memory-Associated Knowledge Editing. Transactions of the Association for Computational Linguistics , 13:938–952, 2025. doi:10.1162/TACL.a.26

  39. [39]

    Y. Wang, T. Sun, C. Tang, et al. HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning. arXiv:2604.11214, 2026

  40. [40]

    Shi et al

    H. Shi et al. Continual Learning for Large Language Models: A Survey. ACM Computing Surveys , 2025

  41. [41]

    L. Wang, X. Zhang, H. Su, and J. Zhu. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE TPAMI, 46(8):5362–5383, 2024. arXiv:2302.00487

  42. [42]

    O. Y. L. Imanov. Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning. arXiv:2601.18699, 2026

  43. [43]

    Li and H.-Y

    C.-A. Li and H.-Y. Lee. Examining Forgetting in Continual Pre-training of Aligned Large Language Models. arXiv:2401.03129, 2024

  44. [44]

    J. Chen, Z. Chen, J. Wang, K. Zhou, Y. Zhu, J. Jiang, Y. Min, W. X. Zhao, et al. Towards Effective and Efficient Continual Pre-training of Large Language Models (Llama-3-SynE). arXiv:2407.18743, 2024

  45. [45]

    Abbes, G

    I. Abbes, G. Subbaraj, M. Riemer, et al. Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models. arXiv:2508.01908, 2025

  46. [46]

    Šliogeris, P

    V. Šliogeris, P. Daniušis, and A. Nakvosas. Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge. arXiv:2505.05946, 2025

  47. [47]

    X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, X. Yang, Z. Xi, R. Zheng, Y. Zou, T. Gui, Q. Zhang, X. Huang. TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models. arXiv:2310.06762, 2023

  48. [48]

    Zellers et al

    R. Zellers et al. HellaSwag: Can a Machine Really Finish Your Sentence? ACL, 2019

  49. [49]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    G. Penedo, H. Kydlíček, L. Ben Allal, A. Lozhkov, M. Mitchell, C. Raffel, L. von Werra, and T. Wolf. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557, 2024

  50. [50]

    R. Li, L. Ben Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, et al. StarCoder: may the source be with you! arXiv:2305.06161, 2023. The StarCoderData training corpus is the deduplicated, decontaminated derivative of The Stack used here for both Python and JavaScript

  51. [51]

    Paster, M

    K. Paster, M. Dos Santos, Z. Azerbayev, and J. Ba. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text. arXiv:2310.06786, 2023

  52. [52]

    E. W. Sayers, J. Beck, E. E. Bolton, J. R. Brister, J. Chan, D. C. Comeau, et al. Database resources of the National Center for Biotechnology Information in 2024. Nucleic Acids Research , 52(D1):D33–D43, 2024

  53. [53]

    arXiv preprint arXiv:2309.09400 , year=

    T. Nguyen, C. Van Nguyen, V. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv:2309.09400, LREC-COLING 2024

  54. [54]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022. arXiv:2201.11903

  55. [55]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS, 2023. arXiv:2305.10601

  56. [56]

    S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian. Training Large Language Models to Reason in a Continuous Latent Space (Coconut). arXiv:2412.06769, 2024

  57. [57]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Control Tasks through World Models. Nature, 640:647–653, 2025. arXiv:2301.04104

  58. [58]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, et al. Mastering Atari, Go, Chess and © Anurup Ganguli 2026 58 TFGN preprint v2 Shogi by Planning with a Learned Model (MuZero). Nature, 588(7839):604–609, 2020. arXiv:1911.08265

  59. [59]

    Y. LeCun. A Path Towards Autonomous Machine Intelligence (JEPA). OpenReview, Version 0.9.2, 2022

  60. [60]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, et al., and Y. LeCun. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985, 2025

  61. [61]

    Planning with Diffusion for Flexible Behavior Synthesis

    M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine. Planning with Diffusion for Flexible Behavior Synthesis (Diffuser). ICML, 2022. arXiv:2205.09991

  62. [62]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    W. Fedus, B. Zoph, and N. Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 23, 2022. arXiv:2101.03961

  63. [63]

    D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, et al. DeepSeekMoE: Towards Ultimate Expert Special- ization in Mixture-of-Experts Language Models. ACL, 2024. arXiv:2401.06066

  64. [64]

    A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023

  65. [65]

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, et al., D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405, 2023

  66. [66]

    K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. NeurIPS, 2023. arXiv:2306.03341

  67. [67]

    E. Todd, M. L. Li, A. Sen Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function Vectors in Large Language Models. ICLR, 2024. arXiv:2310.15213

  68. [68]

    Templeton, T

    A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, et al., and T. Henighan. Scaling Monose- manticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread , Anthropic, May 2024

  69. [69]

    Steering Llama 2 via Contrastive Activation Addition

    N. Panickssery, N. Rimsky, M. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering Llama 2 via Contrastive Activation Addition (CAA). ACL, 2024. arXiv:2312.06681

  70. [70]

    Refusal in Language Models Is Mediated by a Single Direction

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in Language Models is Mediated by a Single Direction. NeurIPS, 2024. arXiv:2406.11717

  71. [71]

    question

    Y. Zhang, B. Tang, T. Ju, S. Duan, and G. Liu. Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought. arXiv:2512.21711, 2025. A Condition Name Index © Anurup Ganguli 2026 59 TFGN preprint v2 T able 31: Canonical external names used throughout this paper, with backbone, regime, phase count, and per-phase token budget. “ER...