pith. sign in

arxiv: 2605.23872 · v1 · pith:NDJEEPKInew · submitted 2026-05-22 · 💻 cs.LG · cs.NA· math.NA· stat.ML

Training-Free Looped Transformers

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAstat.ML
keywords looped transformerstraining-free inferenceODE discretizationpre-norm blocksinference-time adaptationMoE modelsmodel performance
0
0 comments X

The pith

Treating a looped pre-norm transformer block as smaller damped sub-steps of the same forward Euler ODE approximation raises accuracy on frozen checkpoints without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a lightweight inference-time wrapper can loop a contiguous mid-stack block of a pretrained transformer to improve downstream performance. Naive reapplication of the block degrades results, but damping the updates to act as refined sub-steps of the original Euler discretization reverses the loss and produces gains. These improvements appear across dense, sparse MoE, and MLA+MoE families on tasks such as MMLU-Pro, CommonsenseQA, and OpenBookQA, all without fine-tuning or architectural modification. The approach retrofits recurrence onto existing models at test time by treating the block loop as a numerical refinement rather than a new structure.

Core claim

Viewing each pre-norm transformer block as a forward Euler step on an ODE allows the looped reapplication of a mid-stack block to be recast as multiple smaller damped sub-steps of the same approximation. Applied only at inference to a frozen checkpoint, this refinement raises accuracy on question-answering and knowledge benchmarks across seven model families.

What carries the argument

Damped sub-step looping of a contiguous mid-stack pre-norm transformer block, derived from its forward Euler ODE interpretation.

If this is right

  • Accuracy rises on MMLU-Pro by 2.64 points for Qwen3-4B-Instruct, on CommonsenseQA by 1.14 points for Qwen3-30B-A3B-Instruct, and on OpenBookQA by 1.20 points for Moonlight-16B-A3B-Instruct.
  • The same wrapper improves results on dense, sparse MoE, and MLA+MoE architectures without any training.
  • Naive block reapplication degrades performance, confirming that the damping strategy is required for the observed benefit.
  • No continued training, fine-tuning, or architectural changes are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ODE framing may suggest similar inference-time step-size refinements for other sequence architectures that admit an Euler-like interpretation.
  • Variable damping schedules or position-dependent loop depths could be explored as direct extensions of the same numerical view.
  • If the method scales with model size, it offers a low-cost route to squeeze additional performance from already-trained checkpoints.

Load-bearing premise

A pre-norm transformer block can be meaningfully viewed as a forward Euler step on an ODE, so that replacing one large update with multiple damped sub-steps constitutes a refinement rather than an arbitrary modification.

What would settle it

Applying the damped looped updates to a held-out model family and finding no accuracy gain or a net loss relative to the single-pass baseline on standard benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23872 by Chen Liang, Jonathan Li, Lizhang Chen, Ni Lao, Qiang Liu.

Figure 1
Figure 1. Figure 1: Training-free looped transformer wrapper, two iteration modes. A frozen checkpoint is augmented at inference by re-applying a contiguous mid-block g = Lb ◦ · · · ◦ La for K iterations before resuming the post-loop layers. No weights are changed and no new parameters are introduced; the cost is (b − a + 1)(K − 1) extra forward passes through the loop window. (a) Block-mode iterates the whole window K times,… view at source ↗
Figure 2
Figure 2. Figure 2: RK integration vs. naive looping. A tiny MLP pre(R 4 → R 2 ) → block of 3 residual layers (R 2 → R 2 ) → post(R 2 → R 2 ) is trained end-to-end on a 2-D regression target. Each panel fixes K ∈ {2, 4, 8} and shows, over a 220×220 grid in the post-block hidden state z, the median test loss L(z) = medi ∥post(z) − yi∥ 2 (log scale, colored) together with the test-set scatters obtained by naive looping (red cir… view at source ↗
Figure 3
Figure 3. Figure 3: Per-benchmark accuracy across three Qwen3 model variants under the training-free loop wrapper. Each panel shows baseline (striped gray) vs. our wrapper (solid blue) on 4 knowledge￾MC benchmarks. The per-panel y-axis is cropped to emphasize the baseline-to-loop gap. Left: Qwen3-4B-Instruct (dense MHA) on four mid-range general benchmarks. Middle: Qwen3-4B-Base on four hard MMLU 5-shot subjects, selected fro… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of loop count K on 16-task macro-average accuracy. (a) Ours (Algorithm 5) is stable across K ∈ {1, 2, · · · , 24}, while uniform loop [57] ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with other looped transformer methods on Llama-3.2-3B-Instruct and Moonlight-16B-A3B-Instruct. On each backbone we report three knowledge-MC benchmarks under three configurations: baseline (no loop, original checkpoint), naive loop with K=4, and ours (Algorithm 3). 3.7 Comparison with other looped transformer methods We further isolate the contribution of our method by comparing it against two n… view at source ↗
Figure 6
Figure 6. Figure 6: The depth-fraction rule across nine architectures. For each checkpoint we plot the best loop-window range [a/N, b/N] as a horizontal bar with the window’s center (a+b)/(2N) as a filled circle. The shaded band marks the 0.45–0.60 depth fraction, which contains 7 of 9 checkpoints’ optimal window centers—Qwen3 dense (0.6B–4B), Qwen3-30B-A3B MoE,DeepSeek-V2-Lite, Moonlight-16B-A3B, and Llama-3.2-3B—with only t… view at source ↗
read the original abstract

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces training-free looped transformers: a lightweight inference-time wrapper that reapplies a contiguous mid-stack block of layers from a frozen pretrained checkpoint (dense, sparse MoE, or MLA+MoE) without fine-tuning or architectural modification. Naive reapplication is shown to degrade performance; the authors instead damp the looped updates, motivated by the claim that a pre-norm transformer block corresponds to a forward Euler step on an underlying ODE so that multiple damped sub-steps refine the same discretization. Empirical gains are reported across seven model families, including +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct, +1.14 pp on CommonsenseQA for Qwen3-30B-A3B-Instruct, and +1.20 pp on OpenBookQA for Moonlight-16B-A3B-Instruct.

Significance. If the empirical gains prove robust and reproducible, the method would offer a practical, zero-training route to improve accuracy of existing checkpoints at inference time across multiple architectures. The absence of any parameter-free derivation, machine-checked analysis, or falsifiable prediction tied to the ODE limit, however, keeps the conceptual contribution modest even if the numbers hold.

major comments (2)
  1. [Abstract / Motivation] Abstract and motivation section: the claim that a pre-norm transformer block constitutes a forward Euler discretization of an ODE (so that damping turns naive looping into a refinement) is asserted without derivation of the continuous limit, without verification that the block satisfies consistency or Lipschitz conditions required for Euler convergence, and without analysis showing that the damping factor reduces local truncation error. This justification is load-bearing for the central claim that the reported gains arise from improved approximation quality rather than from effective depth, implicit regularization, or an empirical schedule.
  2. [Experiments] Experimental results (reported gains): the improvements (+2.64 pp, +1.14 pp, +1.20 pp) are presented without error bars, without controls that isolate the damping schedule from other looping variants, and without implementation details on how the damping factor is chosen or applied inside the block. Absent these, it is impossible to attribute success specifically to the ODE-refinement mechanism.
minor comments (1)
  1. [Abstract] The abstract states that seven model families were tested but lists only three concrete models; a table or explicit list of all families and checkpoints would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our work. We respond to each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Motivation] Abstract and motivation section: the claim that a pre-norm transformer block constitutes a forward Euler discretization of an ODE (so that damping turns naive looping into a refinement) is asserted without derivation of the continuous limit, without verification that the block satisfies consistency or Lipschitz conditions required for Euler convergence, and without analysis showing that the damping factor reduces local truncation error. This justification is load-bearing for the central claim that the reported gains arise from improved approximation quality rather than from effective depth, implicit regularization, or an empirical schedule.

    Authors: We acknowledge that the manuscript does not provide a full derivation of the continuous limit or the required conditions for convergence. The ODE view is offered as motivation for introducing damping, drawing on the residual structure of pre-norm blocks. We will revise the abstract and motivation section to clarify that this is an intuitive analogy inspired by neural ODEs, without claiming a rigorous discretization analysis. The primary contribution remains the empirical demonstration that damped looping improves performance over naive looping across multiple models. revision: partial

  2. Referee: [Experiments] Experimental results (reported gains): the improvements (+2.64 pp, +1.14 pp, +1.20 pp) are presented without error bars, without controls that isolate the damping schedule from other looping variants, and without implementation details on how the damping factor is chosen or applied inside the block. Absent these, it is impossible to attribute success specifically to the ODE-refinement mechanism.

    Authors: We agree that additional experimental details and controls are necessary to support the claims. In the revised manuscript, we will report error bars based on multiple evaluation runs, include ablation experiments that compare the proposed damped approach against undamped looping and alternative schedules, and provide precise implementation details on the selection and application of the damping factor. These changes will allow better attribution of the gains to the damping mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains are externally measured

full rationale

The paper motivates looped application via an ODE forward-Euler analogy but does not derive performance improvements mathematically from that view. Instead, it applies a test-time wrapper to frozen checkpoints and reports accuracy deltas on external benchmarks (MMLU-Pro, CommonsenseQA, OpenBookQA) across multiple model families. No parameters are fitted inside the method and then renamed as predictions, no self-citations supply the load-bearing justification, and no equation reduces the claimed refinement to a definitional identity. The results remain falsifiable by independent evaluation, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the ODE analogy is invoked heuristically without stated assumptions or fitted constants.

pith-pipeline@v0.9.0 · 5715 in / 1009 out tokens · 20609 ms · 2026-05-25T04:34:14.761131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    A standard pre-norm transformer layer L implements L(x)=x+Attn(LN1(x))+MLP(LN2(x+Attn(LN1(x)))). ... we define the window residual field F_g(x):=g(x)−x. By construction, g(x)=x+F_g(x), which is exactly a forward Euler step with step size h=1 on the autonomous ODE ˙x=F_g(x).

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Naively looping g for K rounds ... is a K-step forward Euler integration ... which approximates x(t=K). But the post-loop layers are not trained to receive the trajectory at t=K ... the principled goal of g(K) is therefore not to advance integration to t=K, but to better approximate the same endpoint x(t=1).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 19 internal anchors

  1. [1]

    Transformers learn to implement preconditioned gradient descent for in-context learning

    Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

  2. [2]

    Alexander C. Aitken. On Bernoulli’s numerical solution of algebraic equations.Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021

  4. [4]

    Relaxed recursive transformers: Effective parameter sharing with layer-wise lora

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

  5. [5]

    Courville, and Se-Young Yun

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron C. Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. InAdvances in Neu- ral Information Processing Systems 39: Annual Conference on Neural Information Processing...

  6. [6]

    Zico Kolter, and Vladlen Koltun

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 688–699, 2019

  7. [7]

    Zico Kolter

    Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale deep equilibrium models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Infor- mation Processing Systems 2020, NeurIPS 2020, 2020

  8. [8]

    Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021

    Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021

  9. [9]

    End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking

    Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Gold- blum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022

  10. [10]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.CoRR, abs/2303.08112, 2023

  11. [11]

    Butcher.Numerical Methods for Ordinary Differential Equations

    John C. Butcher.Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, 3rd edition, 2016. 11

  12. [12]

    Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

    Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 4447–4455, 2025

  13. [13]

    Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026

    Guanxu Chen, Dongrui Liu, and Jing Shao. Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026

  14. [14]

    Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026

    Hung-Hsuan Chen. Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026

  15. [15]

    Demystifying LION: a Hamiltonian perspective

    Lizhang Chen. Demystifying LION: a Hamiltonian perspective. Master’s thesis, The University of Texas at Austin, 2025

  16. [16]

    Cautious weight decay

    Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

  17. [17]

    Muon optimizes under spectral norm constraints

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. Trans. Mach. Learn. Res., 2026, 2026

  18. [18]

    ϕ-balancing for mixture-of-experts training

    Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, and Qiang Liu. ϕ-balancing for mixture-of-experts training. InForty-third International Conference on Machine Learning, ICML 2026, 2026

  19. [19]

    Lion secretly solves a constrained optimization: As Lyapunov predicts

    Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

  20. [20]

    Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

    Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.CoRR, abs/2412.13171, 2024

  21. [21]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.CoRR, abs/1803.05457, 2018

  22. [22]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

  23. [23]

    Man- ning

    Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. Moeut: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

  24. [24]

    Simulation of graph algorithms with looped transformers

    Artur Back de Luca and Kimon Fountoulakis. Simulation of graph algorithms with looped transformers. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 2319–2363, 2024

  25. [25]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024

  26. [26]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

  27. [27]

    Universal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In7th International Conference on Learning Representations, ICLR 2019, 2019

  28. [28]

    Looped transformers for length generalization

    Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 12

  29. [29]

    Towards revealing the mystery behind chain of thought: A theoretical perspective

    Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

  30. [30]

    A framework for few-shot language model evaluation, 2023

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  31. [31]

    Algoformer: An efficient transformer framework with algorithmic structures.Trans

    Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael Ng, Zhenguo Li, and Zhaoqiang Liu. Algoformer: An efficient transformer framework with algorithmic structures.Trans. Mach. Learn. Res., 2025, 2025

  32. [32]

    Reddi, Stefanie Jegelka, and Sanjiv Kumar

    Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 15130–15152, 2024

  33. [33]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.CoRR, abs/2502.05171, 2025

  34. [34]

    Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025

    Jonas Geiping, Xinyu Yang, and Guinan Su. Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025

  35. [35]

    Lee, and Dimitris Papailiopoulos

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, ICML 2023, Proceedings of Machine Learning Research, pages 11398–11442, 2023

  36. [36]

    What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025

    Zixuan Gong, Jiaye Teng, and Yong Liu. What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025

  37. [37]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

  38. [38]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, abs/1603.08983, 2016

  39. [39]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuan- dong Tian. Training large language models to reason in a continuous latent space.CoRR, abs/2412.06769, 2024

  40. [40]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, 2021

  41. [41]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proces...

  42. [42]

    Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation

    Ahmadreza Jeddi, Marco Ciccone, and Babak Taati. Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

  43. [43]

    Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026

    Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, and Kristian Kersting. Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026. 13

  44. [44]

    Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

    Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.CoRR, abs/2604.07822, 2026

  45. [45]

    Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025

    Yeskendir Koishekenov, Aldo Lipani, and Nicola Cancedda. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025

  46. [46]

    arXiv preprint arXiv:2406.19384 , year=

    Vedang Lad, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?CoRR, abs/2406.19384, 2024

  47. [47]

    ALBERT: A lite BERT for self-supervised learning of language representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, 2020

  48. [48]

    Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026

    Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi-Pari, Lizhang Chen, Amy Zhang, and Liu Leqi. Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026

  49. [49]

    Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026

    Jia Liang and Liangming Pan. Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026

  50. [50]

    Cautious optimizers: Improving training with one line of code

    Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

  51. [51]

    Memory-efficient LLM training with online subspace descent

    Kaizhao Liang, Bo Liu, Lizhang Chen, and Qiang Liu. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

  52. [52]

    Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026

    Runlong Liao, Jian Yu, Baiyu Su, Chi Zhang, Lizhang Chen, and Qiang Liu. Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026

  53. [53]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214–3252, 2022

  54. [54]

    Communication efficient distributed training with distributed Lion

    Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, and Qiang Liu. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

  55. [55]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  56. [56]

    Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025

    Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, and Enqi Liu. Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025

  57. [57]

    Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026

    Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, and Ghouthi Boukli Hacene. Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026

  58. [58]

    Mahankali, Tatsunori Hashimoto, and Tengyu Ma

    Arvind V . Mahankali, Tatsunori Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

  59. [59]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, Findings of ACL, pages 20192–20204, 2025

  60. [60]

    The expressive power of transformers with chain of thought

    William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024. 14

  61. [61]

    A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

    William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

  62. [62]

    Can a suit of armor conduct electricity? A new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

  63. [63]

    Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference

    Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

  64. [64]

    Memory-efficient optimization with factorized Hamiltonian descent

    Son Nguyen, Lizhang Chen, Bo Liu, and Qiang Liu. Memory-efficient optimization with factorized Hamiltonian descent. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 2863–2871, 2025

  65. [65]

    Improving adaptive moment optimization via preconditioner diagonalization

    Son Nguyen, Bo Liu, Lizhang Chen, and Qiang Liu. Improving adaptive moment optimization via preconditioner diagonalization. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2026, 2026

  66. [66]

    Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025

    Mohammadmahdi Nouriborji, Morteza Rohanian, and Omid Rohanian. Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025

  67. [67]

    Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, and Sham M. Kakade. The recurrent transformer: Greater effective depth and efficient decoding. CoRR, abs/2604.21215, 2026

  68. [68]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, CHIL 2022, Proceedings of Machine Learning Research, pages 248–260, 2022

  69. [69]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, 2016

  70. [70]

    Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

    Francesco Pappone, Donato Crisostomi, and Emanuele Rodolà. Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

  71. [71]

    Kingma, and Qiang Liu

    Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, and Qiang Liu. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

  72. [72]

    Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden computation in transformer language models.CoRR, abs/2404.15758, 2024

  73. [73]

    Boris T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964

  74. [74]

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models.CoRR, abs/2604.12946, 2026

  75. [75]

    Subformer: Exploring weight sharing for parameter efficiency in generative transformers

    Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. InFindings of the Association for Computational Linguistics: EMNLP 2021, Findings of ACL, pages 4081–4090, 2021

  76. [76]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. InFirst Conference on Language Modeling, 2024

  77. [77]

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reason- ing with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 15

  78. [78]

    Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks

    Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 6695–6706, 2021

  79. [79]

    The curious case of AdamW, 2026

    Baiyu Su, Lizhang Chen, and Qiang Liu. The curious case of AdamW, 2026

  80. [80]

    Lessons on parameter sharing across layers in transformers

    Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2023, pages 78–90, 2023

Showing first 80 references.