pith. sign in

arxiv: 2605.26099 · v3 · pith:QRAORAQWnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

Pith reviewed 2026-06-29 21:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords language modelsoffline recurrencefast weightsstate-space modelslong-context inferencesleep-like consolidationreasoning tasks
0
0 comments X

The pith

Language models improve online inference by consolidating context into fast weights during periodic offline sleep phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can handle long-horizon tasks better by adding a sleep-like process that shifts computation offline. During sleep the model runs N recurrent passes over recent context and updates fast weights in its state-space model blocks using a learned local rule, then clears the key-value cache. This preserves wake-time latency while the authors report that longer sleep improves accuracy, with the biggest lifts on examples that demand deeper reasoning. The approach is evaluated on cellular automata, multi-hop graph retrieval, and a math reasoning task where standard transformers and SSM hybrids fall short.

Core claim

Performing N offline recurrent passes over accumulated context and updating fast weights in SSM blocks via a learned local rule consolidates information in a way that improves subsequent online inference, with gains scaling with sleep duration N and concentrated on harder reasoning examples.

What carries the argument

The offline recurrence sleep mechanism that converts recent context into persistent fast weights before clearing the key-value cache.

If this is right

  • Increasing sleep duration N produces measurable performance gains on the tested tasks.
  • Gains are largest on examples that require deeper reasoning.
  • The method succeeds on cellular automata, multi-hop graph retrieval, and math reasoning where regular transformers and SSM-attention hybrids fail.
  • Extra computation is shifted to sleep periods while online prediction latency remains unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consolidation step could be inserted into other hybrid architectures that already contain SSM blocks.
  • If the learned local update rule generalizes, it might reduce reliance on ever-longer context windows during training.
  • The offline phase resembles a form of experience replay that could be combined with existing memory-augmented models.

Load-bearing premise

Performing N offline recurrent passes over accumulated context and updating fast weights in SSM blocks via a learned local rule will consolidate information in a way that improves subsequent online inference.

What would settle it

If increasing sleep duration N produces no performance gain on the math reasoning task or if the gains are not larger on deeper-reasoning examples, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.26099 by Giulia Fanti, Sangyun Lee, Sean McLeish, Tom Goldstein.

Figure 1
Figure 1. Figure 1: At the eviction boundary, an SSM-attention hybrid performs [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Increasing N improves performance on cellular automaton. Left: Each curve represents a different number of rollout steps t for a hybrid attention-SSM architecture, as in the motivating example section. Increasing t makes the task harder for a vanilla attention-GDN hybrid model. We early-stop 4- and 8-step runs as they converge earlier. Right: For a challenging reasoning task (t = 32), additional offline sl… view at source ↗
Figure 3
Figure 3. Figure 3: Increasing N improves performance on Depo. Test loss of a 4-layer GDN-attention hybrid on the k-hop knowledge retrieval task. Additional offline loops accelerate learning, especially for more reasoning-intensive, higher-hop queries. Here | denotes an eviction boundary, and red text denotes answer tokens. In our setting, each cycle contains up to 75 nodes and spans up to 300 tokens; shorter instances are le… view at source ↗
Figure 4
Figure 4. Figure 4: Increasing N improves performance on GSM-Infinite. GSM-Infinite accuracy over training steps. Subplots group examples by the number of arithmetic operations required by the problem, and colors indicate the number of offline loops N used before cache eviction. Additional loops improve accuracy most clearly on harder problems with more operations, where single-loop models have less sleep-time computation ava… view at source ↗
Figure 5
Figure 5. Figure 5: Increasing N improves accuracy on GSM-Infinite with sliding-window eviction. GSM￾Infinite accuracy with sliding-window eviction over training steps. We fine-tune Ouro 1.4B with window size L = 512 and compare N ∈ {1, 2, 4} sleep passes. inference-time memory: the active context is still capped at L tokens, as in sliding-window attention (SWA). With N = 1, this reduces to a standard SWA-SSM hybrid model [48… view at source ↗
Figure 6
Figure 6. Figure 6: Recurrence across context windows incur minimal training overhead; recurrent-depth linearly increases cost. Training throughput comparison on 1 NVIDIA H200 GPU. Sequence length is set to 12, 000. (a) When window size L is sufficiently large, serialness across context windows do not meaningfully change the throughput compared to the fully parallel baseline. (b) Throughput is roughly inversely proportional t… view at source ↗
read the original abstract

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that incorporating a sleep-like consolidation phase, where the model performs N offline recurrent passes over context to update fast weights in SSM blocks using a learned local rule, allows for improved performance on long-horizon tasks during online inference. This is evaluated on synthetic tasks such as cellular automata and multi-hop graph retrieval, as well as a math reasoning task, demonstrating that larger N leads to better results, with the most significant improvements on examples requiring deeper reasoning where standard models fail.

Significance. If the results are robust, this method represents a meaningful advance in handling long contexts by moving computation to an offline "sleep" phase, preserving online latency. The manuscript's strength lies in its use of controlled synthetic tasks with ablations on sleep duration N and task difficulty, providing evidence for the consolidation mechanism. The empirical nature of the work, with comparisons to failing baselines, is noted positively. The stress-test concern regarding whether offline passes consolidate information to improve inference does not appear to land as a flaw, given the reported controlled comparisons and ablations.

minor comments (3)
  1. [Abstract] Consider adding a sentence summarizing the magnitude of performance gains with increasing N to better highlight the key finding.
  2. [Method] Provide a clearer definition or equation for the learned local rule used to update the fast weights in the SSM blocks.
  3. [Experiments] Include error bars or standard deviations in the reported results for different values of N to allow assessment of variability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of the proposed sleep-like consolidation mechanism, and recommendation for minor revision. We appreciate the recognition of the controlled synthetic tasks, ablations on sleep duration N, and comparisons to failing baselines.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical method for offline recurrence in SSM-augmented models, with performance gains demonstrated via controlled experiments on synthetic and math tasks as N increases. No equations, fitted parameters, or self-citations are shown to reduce the central claim (improved inference via sleep consolidation) to a definitional tautology or input by construction. The reported scaling with N is an observed outcome rather than a renamed fit or self-referential prediction. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing self-citations or ansatzes that collapse the result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The abstract introduces a new consolidation procedure whose effectiveness rests on an unstated learned local rule and the assumption that SSM blocks can usefully store consolidated context; no independent evidence or derivations are supplied.

free parameters (1)
  • N
    Number of offline recurrent passes; performance is reported to increase with larger N
axioms (1)
  • domain assumption A learned local rule can update fast weights to consolidate context during offline passes
    Invoked in the description of the sleep phase
invented entities (1)
  • fast weights in SSM blocks no independent evidence
    purpose: Persistent storage of consolidated context after cache clearing
    New component introduced to enable the sleep mechanism

pith-pipeline@v0.9.1-grok · 5685 in / 1232 out tokens · 48936 ms · 2026-06-29T21:35:25.348546+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 36 canonical work pages · 17 internal anchors

  1. [1]

    Physics of language models: Part 4.1, architecture design and the magic of canon layers

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 4.1, architecture design and the magic of canon layers. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=kxv0M6I7Ud

  2. [2]

    Simple linear attention language models balance the recall-throughput tradeoff.arXiv preprint arXiv:2402.18668, 2024

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff.arXiv preprint arXiv:2402.18668, 2024

  3. [3]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

  4. [4]

    Deep equilibrium models.Advances in neural information processing systems, 32, 2019

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models.Advances in neural information processing systems, 32, 2019

  5. [5]

    End-to-end algorithm synthesis with recurrent networks: Extrapolation without overthinking.Advances in Neural Information Processing Systems, 35:20232–20242, 2022

    Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation without overthinking.Advances in Neural Information Processing Systems, 35:20232–20242, 2022

  6. [6]

    Language models need sleep: Learning to self modify and consolidate memories

    Ali Behrouz, Farnoosh Hashemi, and Vahab Mirrokni. Language models need sleep: Learning to self modify and consolidate memories

  7. [7]

    Transformers to ssms: Dis- tilling quadratic knowledge to subquadratic models.Advances in neural information processing systems, 37:31788–31812, 2024

    Aviv Bick, Kevin Y Li, Eric P Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Dis- tilling quadratic knowledge to subquadratic models.Advances in neural information processing systems, 37:31788–31812, 2024

  8. [8]

    Short window attention enables long-term memorization

    Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazaré, Gabriel Synnaeve, and Hervé Jégou. Short window attention enables long-term memorization.arXiv preprint arXiv:2509.24552, 2025

  9. [9]

    Training plug-n-play knowledge modules with deep context distillation, 2025

    Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vuli ´c, and Alessandro Sordoni. Training plug-n-play knowledge modules with deep context distillation, 2025. URL https://arxiv. org/abs/2503.08727

  10. [10]

    Infiniteicl: Breaking the limit of context window size via long short-term memory transformation, 2025

    Bowen Cao, Deng Cai, and Wai Lam. Infiniteicl: Breaking the limit of context window size via long short-term memory transformation, 2025. URL https://arxiv.org/abs/2504.01707

  11. [11]

    On contrastive divergence learning

    Miguel A Carreira-Perpinan and Geoffrey Hinton. On contrastive divergence learning. In International workshop on artificial intelligence and statistics, pages 33–40. PMLR, 2005. 12

  12. [12]

    Meta-reinforcement learning with self-modifying networks.Advances in Neural Information Processing Systems, 35:7838–7851, 2022

    Mathieu Chalvidal, Thomas Serre, and Rufin VanRullen. Meta-reinforcement learning with self-modifying networks.Advances in Neural Information Processing Systems, 35:7838–7851, 2022

  13. [13]

    Generative adapter: Contextualizing language models in parameters with a single forward pass, 2024

    Tong Chen, Hao Fang, Patrick Xia, Xiaodong Liu, Benjamin Van Durme, Luke Zettlemoyer, Jianfeng Gao, and Hao Cheng. Generative adapter: Contextualizing language models in parameters with a single forward pass, 2024. URLhttps://arxiv.org/abs/2411.05877

  14. [14]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  15. [15]

    Universality in elementary cellular automata.Complex systems, 15(1): 1–40, 2004

    Matthew Cook et al. Universality in elementary cellular automata.Complex systems, 15(1): 1–40, 2004

  16. [16]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, volume 2024, pages 35549–35562, 2024

  17. [17]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

  18. [18]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

  19. [19]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

  20. [20]

    Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676, 2024

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabalesh- warkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676, 2024

  21. [21]

    Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

    Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

  22. [22]

    Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv preprint arXiv:2506.06266, 2025

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general- purpose long context representations via self-study.arXiv preprint arXiv:2506.06266, 2025

  23. [23]

    In-context autoencoder for context compression in a large language model.arXiv preprint arXiv:2307.06945, 2023

    Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model.arXiv preprint arXiv:2307.06945, 2023

  24. [24]

    Scaling up test-time compute with latent reasoning: A recurrent depth approach

    Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InNeurIPS, 2025

  25. [25]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

  26. [26]

    Jet-Nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

    Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-Nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

  27. [27]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  28. [28]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  29. [29]

    Psychology press, 2005

    Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 2005. 13

  30. [30]

    wake-sleep

    Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wake-sleep" algorithm for unsupervised neural networks.Science, 268(5214):1158–1161, 1995

  31. [31]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

  32. [32]

    Repeat after me: Transformers are better than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

    Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

  33. [33]

    Luo, Carla P

    Anmol Kabra, Yilun Yin, Albert Gong, Kamile Stankeviciute, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, and Kilian Q. Weinberger. Learning from synthetic data improves multi-hop reasoning. InInternational Conference on Learning Representations, 2026

  34. [34]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  35. [35]

    Sleep-time compute: Beyond inference scaling at test-time

    Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, and Joseph E Gonzalez. Sleep-time compute: Beyond inference scaling at test-time.arXiv preprint arXiv:2504.13171, 2025

  36. [36]

    Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749, 2022

    Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749, 2022

  37. [37]

    The Serial Scaling Hypothesis

    Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis.arXiv preprint arXiv:2507.12549, 2025

  38. [38]

    James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review, 102(3):419, 1995

  39. [39]

    Teach- ing pretrained language models to think deeper with retrofitted recurrence.arXiv preprint arXiv:2511.07384, 2025

    Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum. Teach- ing pretrained language models to think deeper with retrofitted recurrence.arXiv preprint arXiv:2511.07384, 2025

  40. [40]

    A little depth goes a long way: The expressive power of log-depth transformers.arXiv preprint arXiv:2503.03961, 2025

    William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.arXiv preprint arXiv:2503.03961, 2025

  41. [41]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  42. [42]

    Offline replay supports planning in human reinforcement learning.elife, 7:e32548, 2018

    Ida Momennejad, A Ross Otto, Nathaniel D Daw, and Kenneth A Norman. Offline replay supports planning in human reinforcement learning.elife, 7:e32548, 2018

  43. [43]

    P-completeness of cellular automaton rule 110

    Turlough Neary and Damien Woods. P-completeness of cellular automaton rule 110. In International Colloquium on Automata, Languages, and Programming, pages 132–143. Springer, 2006

  44. [44]

    Deep sequence models tend to memorize geometrically; it is unclear why

    Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, and Sanjiv Kumar. Deep sequence models tend to memorize geometrically; it is unclear why.arXiv preprint arXiv:2510.26745, 2025

  45. [45]

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

    NVIDIA. NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba-Transformer reasoning model.arXiv preprint arXiv:2508.14444, 2025

  46. [46]

    Parcae: Scaling Laws For Stable Looped Language Models

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946, 2026

  47. [47]

    About sleep’s role in memory.Physiological reviews, 2013

    Björn Rasch and Jan Born. About sleep’s role in memory.Physiological reviews, 2013. 14

  48. [48]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522, 2024

    Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522, 2024

  49. [49]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

  50. [50]

    Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks.Advances in Neural Information Processing Systems, 34:6695–6706, 2021

    Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks.Advances in Neural Information Processing Systems, 34:6695–6706, 2021

  51. [51]

    How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    Kristian Schwethelm, Daniel Rueckert, and Georgios Kaissis. How much is one recurrence worth? iso-depth scaling laws for looped language models.arXiv preprint arXiv:2604.21106, 2026

  52. [52]

    Learning by distilling context, 2022

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context, 2022. URL https://arxiv.org/abs/2209.15189

  53. [53]

    Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

  54. [54]

    Online adaptation of language models with a memory of amortized contexts, 2024

    Jihoon Tack, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, and Jonathan Richard Schwarz. Online adaptation of language models with a memory of amortized contexts, 2024. URLhttps://arxiv.org/abs/2403.04317

  55. [55]

    End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675,

    Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

  56. [56]

    Qwen3.5: Accelerating productivity with native multimodal agents, February

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  57. [57]

    URLhttps://qwen.ai/blog?id=qwen3.5

  58. [58]

    Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

    Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

  59. [59]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, 2017

  60. [60]

    The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024

  61. [61]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

  62. [62]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  63. [63]

    Parallelizing linear transformers with the delta rule over sequence length.Advances in neural information processing systems, 37:115491–115522, 2024

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in neural information processing systems, 37:115491–115522, 2024

  64. [64]

    Train- ing large reasoning models efficiently via progressive thought encoding.arXiv preprint arXiv:2602.16839, 2026

    Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, and Jianfeng Gao. Train- ing large reasoning models efficiently via progressive thought encoding.arXiv preprint arXiv:2602.16839, 2026

  65. [65]

    GSM-Infinite: How do your LLMs behave over infinitely increasing reasoning complexity and context length? In ICML 2025 Workshop on Long-Context Foundation Models, 2025

    Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. GSM-Infinite: How do your LLMs behave over infinitely increasing reasoning complexity and context length? In ICML 2025 Workshop on Long-Context Foundation Models, 2025

  66. [66]

    Scaling Latent Reasoning via Looped Language Models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped language models.arXiv preprint arXiv:2510.25741, 2025. 15