Training-Free Looped Transformers
Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3
The pith
Treating a looped pre-norm transformer block as smaller damped sub-steps of the same forward Euler ODE approximation raises accuracy on frozen checkpoints without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Viewing each pre-norm transformer block as a forward Euler step on an ODE allows the looped reapplication of a mid-stack block to be recast as multiple smaller damped sub-steps of the same approximation. Applied only at inference to a frozen checkpoint, this refinement raises accuracy on question-answering and knowledge benchmarks across seven model families.
What carries the argument
Damped sub-step looping of a contiguous mid-stack pre-norm transformer block, derived from its forward Euler ODE interpretation.
If this is right
- Accuracy rises on MMLU-Pro by 2.64 points for Qwen3-4B-Instruct, on CommonsenseQA by 1.14 points for Qwen3-30B-A3B-Instruct, and on OpenBookQA by 1.20 points for Moonlight-16B-A3B-Instruct.
- The same wrapper improves results on dense, sparse MoE, and MLA+MoE architectures without any training.
- Naive block reapplication degrades performance, confirming that the damping strategy is required for the observed benefit.
- No continued training, fine-tuning, or architectural changes are needed.
Where Pith is reading between the lines
- The ODE framing may suggest similar inference-time step-size refinements for other sequence architectures that admit an Euler-like interpretation.
- Variable damping schedules or position-dependent loop depths could be explored as direct extensions of the same numerical view.
- If the method scales with model size, it offers a low-cost route to squeeze additional performance from already-trained checkpoints.
Load-bearing premise
A pre-norm transformer block can be meaningfully viewed as a forward Euler step on an ODE, so that replacing one large update with multiple damped sub-steps constitutes a refinement rather than an arbitrary modification.
What would settle it
Applying the damped looped updates to a held-out model family and finding no accuracy gain or a net loss relative to the single-pass baseline on standard benchmarks would falsify the central claim.
Figures
read the original abstract
We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces training-free looped transformers: a lightweight inference-time wrapper that reapplies a contiguous mid-stack block of layers from a frozen pretrained checkpoint (dense, sparse MoE, or MLA+MoE) without fine-tuning or architectural modification. Naive reapplication is shown to degrade performance; the authors instead damp the looped updates, motivated by the claim that a pre-norm transformer block corresponds to a forward Euler step on an underlying ODE so that multiple damped sub-steps refine the same discretization. Empirical gains are reported across seven model families, including +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct, +1.14 pp on CommonsenseQA for Qwen3-30B-A3B-Instruct, and +1.20 pp on OpenBookQA for Moonlight-16B-A3B-Instruct.
Significance. If the empirical gains prove robust and reproducible, the method would offer a practical, zero-training route to improve accuracy of existing checkpoints at inference time across multiple architectures. The absence of any parameter-free derivation, machine-checked analysis, or falsifiable prediction tied to the ODE limit, however, keeps the conceptual contribution modest even if the numbers hold.
major comments (2)
- [Abstract / Motivation] Abstract and motivation section: the claim that a pre-norm transformer block constitutes a forward Euler discretization of an ODE (so that damping turns naive looping into a refinement) is asserted without derivation of the continuous limit, without verification that the block satisfies consistency or Lipschitz conditions required for Euler convergence, and without analysis showing that the damping factor reduces local truncation error. This justification is load-bearing for the central claim that the reported gains arise from improved approximation quality rather than from effective depth, implicit regularization, or an empirical schedule.
- [Experiments] Experimental results (reported gains): the improvements (+2.64 pp, +1.14 pp, +1.20 pp) are presented without error bars, without controls that isolate the damping schedule from other looping variants, and without implementation details on how the damping factor is chosen or applied inside the block. Absent these, it is impossible to attribute success specifically to the ODE-refinement mechanism.
minor comments (1)
- [Abstract] The abstract states that seven model families were tested but lists only three concrete models; a table or explicit list of all families and checkpoints would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our work. We respond to each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Motivation] Abstract and motivation section: the claim that a pre-norm transformer block constitutes a forward Euler discretization of an ODE (so that damping turns naive looping into a refinement) is asserted without derivation of the continuous limit, without verification that the block satisfies consistency or Lipschitz conditions required for Euler convergence, and without analysis showing that the damping factor reduces local truncation error. This justification is load-bearing for the central claim that the reported gains arise from improved approximation quality rather than from effective depth, implicit regularization, or an empirical schedule.
Authors: We acknowledge that the manuscript does not provide a full derivation of the continuous limit or the required conditions for convergence. The ODE view is offered as motivation for introducing damping, drawing on the residual structure of pre-norm blocks. We will revise the abstract and motivation section to clarify that this is an intuitive analogy inspired by neural ODEs, without claiming a rigorous discretization analysis. The primary contribution remains the empirical demonstration that damped looping improves performance over naive looping across multiple models. revision: partial
-
Referee: [Experiments] Experimental results (reported gains): the improvements (+2.64 pp, +1.14 pp, +1.20 pp) are presented without error bars, without controls that isolate the damping schedule from other looping variants, and without implementation details on how the damping factor is chosen or applied inside the block. Absent these, it is impossible to attribute success specifically to the ODE-refinement mechanism.
Authors: We agree that additional experimental details and controls are necessary to support the claims. In the revised manuscript, we will report error bars based on multiple evaluation runs, include ablation experiments that compare the proposed damped approach against undamped looping and alternative schedules, and provide precise implementation details on the selection and application of the damping factor. These changes will allow better attribution of the gains to the damping mechanism. revision: yes
Circularity Check
No significant circularity; empirical gains are externally measured
full rationale
The paper motivates looped application via an ODE forward-Euler analogy but does not derive performance improvements mathematically from that view. Instead, it applies a test-time wrapper to frozen checkpoints and reports accuracy deltas on external benchmarks (MMLU-Pro, CommonsenseQA, OpenBookQA) across multiple model families. No parameters are fitted inside the method and then renamed as predictions, no self-citations supply the load-bearing justification, and no equation reduces the claimed refinement to a definitional identity. The results remain falsifiable by independent evaluation, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
A standard pre-norm transformer layer L implements L(x)=x+Attn(LN1(x))+MLP(LN2(x+Attn(LN1(x)))). ... we define the window residual field F_g(x):=g(x)−x. By construction, g(x)=x+F_g(x), which is exactly a forward Euler step with step size h=1 on the autonomous ODE ˙x=F_g(x).
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Naively looping g for K rounds ... is a K-step forward Euler integration ... which approximates x(t=K). But the post-loop layers are not trained to receive the trajectory at t=K ... the principled goal of g(K) is therefore not to advance integration to t=K, but to better approximate the same endpoint x(t=1).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023
work page 2023
-
[2]
Alexander C. Aitken. On Bernoulli’s numerical solution of algebraic equations.Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927
work page 1927
-
[3]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Relaxed recursive transformers: Effective parameter sharing with layer-wise lora
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025
work page 2025
-
[5]
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron C. Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. InAdvances in Neu- ral Information Processing Systems 39: Annual Conference on Neural Information Processing...
work page 2025
-
[6]
Zico Kolter, and Vladlen Koltun
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 688–699, 2019
work page 2019
-
[7]
Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale deep equilibrium models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Infor- mation Processing Systems 2020, NeurIPS 2020, 2020
work page 2020
-
[8]
Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021
Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021
-
[9]
End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking
Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Gold- blum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022
work page 2022
-
[10]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.CoRR, abs/2303.08112, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Butcher.Numerical Methods for Ordinary Differential Equations
John C. Butcher.Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, 3rd edition, 2016. 11
work page 2016
-
[12]
Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 4447–4455, 2025
work page 2025
-
[13]
Guanxu Chen, Dongrui Liu, and Jing Shao. Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026
-
[14]
Hung-Hsuan Chen. Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026
-
[15]
Demystifying LION: a Hamiltonian perspective
Lizhang Chen. Demystifying LION: a Hamiltonian perspective. Master’s thesis, The University of Texas at Austin, 2025
work page 2025
-
[16]
Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026
work page 2026
-
[17]
Muon optimizes under spectral norm constraints
Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. Trans. Mach. Learn. Res., 2026, 2026
work page 2026
-
[18]
ϕ-balancing for mixture-of-experts training
Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, and Qiang Liu. ϕ-balancing for mixture-of-experts training. InForty-third International Conference on Machine Learning, ICML 2026, 2026
work page 2026
-
[19]
Lion secretly solves a constrained optimization: As Lyapunov predicts
Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024
work page 2024
-
[20]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.CoRR, abs/2412.13171, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.CoRR, abs/1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. Moeut: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024
work page 2024
-
[24]
Simulation of graph algorithms with looped transformers
Artur Back de Luca and Kimon Fountoulakis. Simulation of graph algorithms with looped transformers. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 2319–2363, 2024
work page 2024
-
[25]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In7th International Conference on Learning Representations, ICLR 2019, 2019
work page 2019
-
[28]
Looped transformers for length generalization
Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 12
work page 2025
-
[29]
Towards revealing the mystery behind chain of thought: A theoretical perspective
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023
work page 2023
-
[30]
A framework for few-shot language model evaluation, 2023
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
work page 2023
-
[31]
Algoformer: An efficient transformer framework with algorithmic structures.Trans
Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael Ng, Zhenguo Li, and Zhaoqiang Liu. Algoformer: An efficient transformer framework with algorithmic structures.Trans. Mach. Learn. Res., 2025, 2025
work page 2025
-
[32]
Reddi, Stefanie Jegelka, and Sanjiv Kumar
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 15130–15152, 2024
work page 2024
-
[33]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.CoRR, abs/2502.05171, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Jonas Geiping, Xinyu Yang, and Guinan Su. Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025
-
[35]
Lee, and Dimitris Papailiopoulos
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, ICML 2023, Proceedings of Machine Learning Research, pages 11398–11442, 2023
work page 2023
-
[36]
Zixuan Gong, Jiaye Teng, and Yong Liu. What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025
-
[37]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024
work page 2024
-
[38]
Adaptive Computation Time for Recurrent Neural Networks
Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, abs/1603.08983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[39]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuan- dong Tian. Training large language models to reason in a continuous latent space.CoRR, abs/2412.06769, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, 2021
work page 2021
-
[41]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proces...
work page 2023
-
[42]
Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation
Ahmadreza Jeddi, Marco Ciccone, and Babak Taati. Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026
work page 2026
-
[43]
Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, and Kristian Kersting. Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026. 13
-
[44]
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.CoRR, abs/2604.07822, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Yeskendir Koishekenov, Aldo Lipani, and Nicola Cancedda. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025
-
[46]
arXiv preprint arXiv:2406.19384 , year=
Vedang Lad, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?CoRR, abs/2406.19384, 2024
-
[47]
ALBERT: A lite BERT for self-supervised learning of language representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, 2020
work page 2020
-
[48]
Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026
Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi-Pari, Lizhang Chen, Amy Zhang, and Liu Leqi. Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026
-
[49]
Jia Liang and Liangming Pan. Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026
-
[50]
Cautious optimizers: Improving training with one line of code
Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026
work page 2026
-
[51]
Memory-efficient LLM training with online subspace descent
Kaizhao Liang, Bo Liu, Lizhang Chen, and Qiang Liu. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024
work page 2024
-
[52]
Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026
Runlong Liao, Jian Yu, Baiyu Su, Chi Zhang, Lizhang Chen, and Qiang Liu. Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026
-
[53]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214–3252, 2022
work page 2022
-
[54]
Communication efficient distributed training with distributed Lion
Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, and Qiang Liu. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024
work page 2024
-
[55]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025
Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, and Enqi Liu. Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025
-
[57]
Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, and Ghouthi Boukli Hacene. Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026
-
[58]
Mahankali, Tatsunori Hashimoto, and Tengyu Ma
Arvind V . Mahankali, Tatsunori Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024
work page 2024
-
[59]
Shortgpt: Layers in large language models are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, Findings of ACL, pages 20192–20204, 2025
work page 2025
-
[60]
The expressive power of transformers with chain of thought
William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024. 14
work page 2024
-
[61]
William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025
-
[62]
Can a suit of armor conduct electricity? A new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018
work page 2018
-
[63]
Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference
Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025
work page 2025
-
[64]
Memory-efficient optimization with factorized Hamiltonian descent
Son Nguyen, Lizhang Chen, Bo Liu, and Qiang Liu. Memory-efficient optimization with factorized Hamiltonian descent. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 2863–2871, 2025
work page 2025
-
[65]
Improving adaptive moment optimization via preconditioner diagonalization
Son Nguyen, Bo Liu, Lizhang Chen, and Qiang Liu. Improving adaptive moment optimization via preconditioner diagonalization. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2026, 2026
work page 2026
-
[66]
Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025
Mohammadmahdi Nouriborji, Morteza Rohanian, and Omid Rohanian. Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025
-
[67]
Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, and Sham M. Kakade. The recurrent transformer: Greater effective depth and efficient decoding. CoRR, abs/2604.21215, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[68]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, CHIL 2022, Proceedings of Machine Learning Research, pages 248–260, 2022
work page 2022
-
[69]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, 2016
work page 2016
-
[70]
Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025
Francesco Pappone, Donato Crisostomi, and Emanuele Rodolà. Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025
-
[71]
Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, and Qiang Liu. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026
work page 2026
- [72]
-
[73]
Boris T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964
work page 1964
-
[74]
Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models.CoRR, abs/2604.12946, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[75]
Subformer: Exploring weight sharing for parameter efficiency in generative transformers
Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. InFindings of the Association for Computational Linguistics: EMNLP 2021, Findings of ACL, pages 4081–4090, 2021
work page 2021
-
[76]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. InFirst Conference on Language Modeling, 2024
work page 2024
-
[77]
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reason- ing with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 15
work page 2025
-
[78]
Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks
Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 6695–6706, 2021
work page 2021
-
[79]
The curious case of AdamW, 2026
Baiyu Su, Lizhang Chen, and Qiang Liu. The curious case of AdamW, 2026
work page 2026
-
[80]
Lessons on parameter sharing across layers in transformers
Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2023, pages 78–90, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.