Training-Free Looped Transformers

Chen Liang; Jonathan Li; Lizhang Chen; Ni Lao; Qiang Liu

arxiv: 2605.23872 · v1 · pith:NDJEEPKInew · submitted 2026-05-22 · 💻 cs.LG · cs.NA· math.NA· stat.ML

Training-Free Looped Transformers

Lizhang Chen , Jonathan Li , Chen Liang , Ni Lao , Qiang Liu This is my paper

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAstat.ML

keywords looped transformerstraining-free inferenceODE discretizationpre-norm blocksinference-time adaptationMoE modelsmodel performance

0 comments

The pith

Treating a looped pre-norm transformer block as smaller damped sub-steps of the same forward Euler ODE approximation raises accuracy on frozen checkpoints without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a lightweight inference-time wrapper can loop a contiguous mid-stack block of a pretrained transformer to improve downstream performance. Naive reapplication of the block degrades results, but damping the updates to act as refined sub-steps of the original Euler discretization reverses the loss and produces gains. These improvements appear across dense, sparse MoE, and MLA+MoE families on tasks such as MMLU-Pro, CommonsenseQA, and OpenBookQA, all without fine-tuning or architectural modification. The approach retrofits recurrence onto existing models at test time by treating the block loop as a numerical refinement rather than a new structure.

Core claim

Viewing each pre-norm transformer block as a forward Euler step on an ODE allows the looped reapplication of a mid-stack block to be recast as multiple smaller damped sub-steps of the same approximation. Applied only at inference to a frozen checkpoint, this refinement raises accuracy on question-answering and knowledge benchmarks across seven model families.

What carries the argument

Damped sub-step looping of a contiguous mid-stack pre-norm transformer block, derived from its forward Euler ODE interpretation.

If this is right

Accuracy rises on MMLU-Pro by 2.64 points for Qwen3-4B-Instruct, on CommonsenseQA by 1.14 points for Qwen3-30B-A3B-Instruct, and on OpenBookQA by 1.20 points for Moonlight-16B-A3B-Instruct.
The same wrapper improves results on dense, sparse MoE, and MLA+MoE architectures without any training.
Naive block reapplication degrades performance, confirming that the damping strategy is required for the observed benefit.
No continued training, fine-tuning, or architectural changes are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ODE framing may suggest similar inference-time step-size refinements for other sequence architectures that admit an Euler-like interpretation.
Variable damping schedules or position-dependent loop depths could be explored as direct extensions of the same numerical view.
If the method scales with model size, it offers a low-cost route to squeeze additional performance from already-trained checkpoints.

Load-bearing premise

A pre-norm transformer block can be meaningfully viewed as a forward Euler step on an ODE, so that replacing one large update with multiple damped sub-steps constitutes a refinement rather than an arbitrary modification.

What would settle it

Applying the damped looped updates to a held-out model family and finding no accuracy gain or a net loss relative to the single-pass baseline on standard benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23872 by Chen Liang, Jonathan Li, Lizhang Chen, Ni Lao, Qiang Liu.

**Figure 1.** Figure 1: Training-free looped transformer wrapper, two iteration modes. A frozen checkpoint is augmented at inference by re-applying a contiguous mid-block g = Lb ◦ · · · ◦ La for K iterations before resuming the post-loop layers. No weights are changed and no new parameters are introduced; the cost is (b − a + 1)(K − 1) extra forward passes through the loop window. (a) Block-mode iterates the whole window K times,… view at source ↗

**Figure 2.** Figure 2: RK integration vs. naive looping. A tiny MLP pre(R 4 → R 2 ) → block of 3 residual layers (R 2 → R 2 ) → post(R 2 → R 2 ) is trained end-to-end on a 2-D regression target. Each panel fixes K ∈ {2, 4, 8} and shows, over a 220×220 grid in the post-block hidden state z, the median test loss L(z) = medi ∥post(z) − yi∥ 2 (log scale, colored) together with the test-set scatters obtained by naive looping (red cir… view at source ↗

**Figure 3.** Figure 3: Per-benchmark accuracy across three Qwen3 model variants under the training-free loop wrapper. Each panel shows baseline (striped gray) vs. our wrapper (solid blue) on 4 knowledgeMC benchmarks. The per-panel y-axis is cropped to emphasize the baseline-to-loop gap. Left: Qwen3-4B-Instruct (dense MHA) on four mid-range general benchmarks. Middle: Qwen3-4B-Base on four hard MMLU 5-shot subjects, selected fro… view at source ↗

**Figure 4.** Figure 4: Effect of loop count K on 16-task macro-average accuracy. (a) Ours (Algorithm 5) is stable across K ∈ {1, 2, · · · , 24}, while uniform loop [57] ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison with other looped transformer methods on Llama-3.2-3B-Instruct and Moonlight-16B-A3B-Instruct. On each backbone we report three knowledge-MC benchmarks under three configurations: baseline (no loop, original checkpoint), naive loop with K=4, and ours (Algorithm 3). 3.7 Comparison with other looped transformer methods We further isolate the contribution of our method by comparing it against two n… view at source ↗

**Figure 6.** Figure 6: The depth-fraction rule across nine architectures. For each checkpoint we plot the best loop-window range [a/N, b/N] as a horizontal bar with the window’s center (a+b)/(2N) as a filled circle. The shaded band marks the 0.45–0.60 depth fraction, which contains 7 of 9 checkpoints’ optimal window centers—Qwen3 dense (0.6B–4B), Qwen3-30B-A3B MoE,DeepSeek-V2-Lite, Moonlight-16B-A3B, and Llama-3.2-3B—with only t… view at source ↗

read the original abstract

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical training-free looping trick that lifts accuracy on several frozen checkpoints, but the ODE framing is unverified and the gains lack controls.

read the letter

The core takeaway is that applying a damped loop to a mid-stack block at inference time improves results on Qwen and Moonlight models without any retraining. That part is new relative to earlier looped-transformer work that trains the recurrence from scratch. The method is simple to implement on existing checkpoints and the authors test it across dense, MoE, and MLA variants, which is useful for people who already have good base models and want a quick boost on MMLU-Pro or CommonsenseQA. The reported deltas (+2.64 pp, +1.14 pp, +1.20 pp) are large enough to notice in practice. The paper also correctly notes that naive reapplication hurts, so the damping choice matters. That observation is worth keeping. The soft spot is the justification. The abstract motivates the damping by treating a pre-norm block as a forward-Euler step on an ODE, yet supplies no derivation, no Lipschitz check, and no truncation-error argument showing why the damped sub-steps actually refine the approximation rather than just adding a different schedule. Without that link, the performance numbers could come from extra depth, implicit regularization, or any other non-ODE effect. The stress-test concern lands: the central claim rests on an unverified continuous limit. No error bars or ablation tables are described either, so it is hard to judge stability. The work is still worth referee time because the empirical pattern is reproducible on public checkpoints and the training-free constraint is a real constraint in deployment. Readers who care about inference tricks for already-trained LLMs will get value; theorists looking for a clean ODE story will not. I would send it out for review with the expectation that the authors need to either drop the ODE language or supply the missing analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces training-free looped transformers: a lightweight inference-time wrapper that reapplies a contiguous mid-stack block of layers from a frozen pretrained checkpoint (dense, sparse MoE, or MLA+MoE) without fine-tuning or architectural modification. Naive reapplication is shown to degrade performance; the authors instead damp the looped updates, motivated by the claim that a pre-norm transformer block corresponds to a forward Euler step on an underlying ODE so that multiple damped sub-steps refine the same discretization. Empirical gains are reported across seven model families, including +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct, +1.14 pp on CommonsenseQA for Qwen3-30B-A3B-Instruct, and +1.20 pp on OpenBookQA for Moonlight-16B-A3B-Instruct.

Significance. If the empirical gains prove robust and reproducible, the method would offer a practical, zero-training route to improve accuracy of existing checkpoints at inference time across multiple architectures. The absence of any parameter-free derivation, machine-checked analysis, or falsifiable prediction tied to the ODE limit, however, keeps the conceptual contribution modest even if the numbers hold.

major comments (2)

[Abstract / Motivation] Abstract and motivation section: the claim that a pre-norm transformer block constitutes a forward Euler discretization of an ODE (so that damping turns naive looping into a refinement) is asserted without derivation of the continuous limit, without verification that the block satisfies consistency or Lipschitz conditions required for Euler convergence, and without analysis showing that the damping factor reduces local truncation error. This justification is load-bearing for the central claim that the reported gains arise from improved approximation quality rather than from effective depth, implicit regularization, or an empirical schedule.
[Experiments] Experimental results (reported gains): the improvements (+2.64 pp, +1.14 pp, +1.20 pp) are presented without error bars, without controls that isolate the damping schedule from other looping variants, and without implementation details on how the damping factor is chosen or applied inside the block. Absent these, it is impossible to attribute success specifically to the ODE-refinement mechanism.

minor comments (1)

[Abstract] The abstract states that seven model families were tested but lists only three concrete models; a table or explicit list of all families and checkpoints would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our work. We respond to each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Motivation] Abstract and motivation section: the claim that a pre-norm transformer block constitutes a forward Euler discretization of an ODE (so that damping turns naive looping into a refinement) is asserted without derivation of the continuous limit, without verification that the block satisfies consistency or Lipschitz conditions required for Euler convergence, and without analysis showing that the damping factor reduces local truncation error. This justification is load-bearing for the central claim that the reported gains arise from improved approximation quality rather than from effective depth, implicit regularization, or an empirical schedule.

Authors: We acknowledge that the manuscript does not provide a full derivation of the continuous limit or the required conditions for convergence. The ODE view is offered as motivation for introducing damping, drawing on the residual structure of pre-norm blocks. We will revise the abstract and motivation section to clarify that this is an intuitive analogy inspired by neural ODEs, without claiming a rigorous discretization analysis. The primary contribution remains the empirical demonstration that damped looping improves performance over naive looping across multiple models. revision: partial
Referee: [Experiments] Experimental results (reported gains): the improvements (+2.64 pp, +1.14 pp, +1.20 pp) are presented without error bars, without controls that isolate the damping schedule from other looping variants, and without implementation details on how the damping factor is chosen or applied inside the block. Absent these, it is impossible to attribute success specifically to the ODE-refinement mechanism.

Authors: We agree that additional experimental details and controls are necessary to support the claims. In the revised manuscript, we will report error bars based on multiple evaluation runs, include ablation experiments that compare the proposed damped approach against undamped looping and alternative schedules, and provide precise implementation details on the selection and application of the damping factor. These changes will allow better attribution of the gains to the damping mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains are externally measured

full rationale

The paper motivates looped application via an ODE forward-Euler analogy but does not derive performance improvements mathematically from that view. Instead, it applies a test-time wrapper to frozen checkpoints and reports accuracy deltas on external benchmarks (MMLU-Pro, CommonsenseQA, OpenBookQA) across multiple model families. No parameters are fitted inside the method and then renamed as predictions, no self-citations supply the load-bearing justification, and no equation reduces the claimed refinement to a definitional identity. The results remain falsifiable by independent evaluation, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the ODE analogy is invoked heuristically without stated assumptions or fitted constants.

pith-pipeline@v0.9.0 · 5715 in / 1009 out tokens · 20609 ms · 2026-05-25T04:34:14.761131+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

A standard pre-norm transformer layer L implements L(x)=x+Attn(LN1(x))+MLP(LN2(x+Attn(LN1(x)))). ... we define the window residual field F_g(x):=g(x)−x. By construction, g(x)=x+F_g(x), which is exactly a forward Euler step with step size h=1 on the autonomous ODE ˙x=F_g(x).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Naively looping g for K rounds ... is a K-step forward Euler integration ... which approximates x(t=K). But the post-loop layers are not trained to receive the trajectory at t=K ... the principled goal of g(K) is therefore not to advance integration to t=K, but to better approximate the same endpoint x(t=1).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 19 internal anchors

[1]

Transformers learn to implement preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

work page 2023
[2]

Alexander C. Aitken. On Bernoulli’s numerical solution of algebraic equations.Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927

work page 1927
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Relaxed recursive transformers: Effective parameter sharing with layer-wise lora

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

work page 2025
[5]

Courville, and Se-Young Yun

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron C. Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. InAdvances in Neu- ral Information Processing Systems 39: Annual Conference on Neural Information Processing...

work page 2025
[6]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 688–699, 2019

work page 2019
[7]

Zico Kolter

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale deep equilibrium models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Infor- mation Processing Systems 2020, NeurIPS 2020, 2020

work page 2020
[8]

Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021

Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021

work page arXiv 2021
[9]

End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking

Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Gold- blum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022

work page 2022
[10]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.CoRR, abs/2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Butcher.Numerical Methods for Ordinary Differential Equations

John C. Butcher.Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, 3rd edition, 2016. 11

work page 2016
[12]

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 4447–4455, 2025

work page 2025
[13]

Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026

Guanxu Chen, Dongrui Liu, and Jing Shao. Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026

work page arXiv 2026
[14]

Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026

Hung-Hsuan Chen. Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026

work page arXiv 2026
[15]

Demystifying LION: a Hamiltonian perspective

Lizhang Chen. Demystifying LION: a Hamiltonian perspective. Master’s thesis, The University of Texas at Austin, 2025

work page 2025
[16]

Cautious weight decay

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026
[17]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. Trans. Mach. Learn. Res., 2026, 2026

work page 2026
[18]

ϕ-balancing for mixture-of-experts training

Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, and Qiang Liu. ϕ-balancing for mixture-of-experts training. InForty-third International Conference on Machine Learning, ICML 2026, 2026

work page 2026
[19]

Lion secretly solves a constrained optimization: As Lyapunov predicts

Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

work page 2024
[20]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.CoRR, abs/2412.13171, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.CoRR, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Man- ning

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. Moeut: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

work page 2024
[24]

Simulation of graph algorithms with looped transformers

Artur Back de Luca and Kimon Fountoulakis. Simulation of graph algorithms with looped transformers. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 2319–2363, 2024

work page 2024
[25]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In7th International Conference on Learning Representations, ICLR 2019, 2019

work page 2019
[28]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 12

work page 2025
[29]

Towards revealing the mystery behind chain of thought: A theoretical perspective

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

work page 2023
[30]

A framework for few-shot language model evaluation, 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2023
[31]

Algoformer: An efficient transformer framework with algorithmic structures.Trans

Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael Ng, Zhenguo Li, and Zhaoqiang Liu. Algoformer: An efficient transformer framework with algorithmic structures.Trans. Mach. Learn. Res., 2025, 2025

work page 2025
[32]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 15130–15152, 2024

work page 2024
[33]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.CoRR, abs/2502.05171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025

Jonas Geiping, Xinyu Yang, and Guinan Su. Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025

work page arXiv 2025
[35]

Lee, and Dimitris Papailiopoulos

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, ICML 2023, Proceedings of Machine Learning Research, pages 11398–11442, 2023

work page 2023
[36]

What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025

Zixuan Gong, Jiaye Teng, and Yong Liu. What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025

work page arXiv 2025
[37]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

work page 2024
[38]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, abs/1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[39]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuan- dong Tian. Training large language models to reason in a continuous latent space.CoRR, abs/2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, 2021

work page 2021
[41]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proces...

work page 2023
[42]

Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation

Ahmadreza Jeddi, Marco Ciccone, and Babak Taati. Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026
[43]

Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, and Kristian Kersting. Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026. 13

work page arXiv 2026
[44]

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.CoRR, abs/2604.07822, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025

Yeskendir Koishekenov, Aldo Lipani, and Nicola Cancedda. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025

work page arXiv 2025
[46]

arXiv preprint arXiv:2406.19384 , year=

Vedang Lad, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?CoRR, abs/2406.19384, 2024

work page arXiv 2024
[47]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, 2020

work page 2020
[48]

Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026

Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi-Pari, Lizhang Chen, Amy Zhang, and Liu Leqi. Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026

work page arXiv 2026
[49]

Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026

Jia Liang and Liangming Pan. Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026

work page arXiv 2026
[50]

Cautious optimizers: Improving training with one line of code

Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026
[51]

Memory-efficient LLM training with online subspace descent

Kaizhao Liang, Bo Liu, Lizhang Chen, and Qiang Liu. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

work page 2024
[52]

Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026

Runlong Liao, Jian Yu, Baiyu Su, Chi Zhang, Lizhang Chen, and Qiang Liu. Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026

work page arXiv 2026
[53]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214–3252, 2022

work page 2022
[54]

Communication efficient distributed training with distributed Lion

Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, and Qiang Liu. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

work page 2024
[55]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025

Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, and Enqi Liu. Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025

work page arXiv 2025
[57]

Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, and Ghouthi Boukli Hacene. Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026

work page arXiv 2026
[58]

Mahankali, Tatsunori Hashimoto, and Tengyu Ma

Arvind V . Mahankali, Tatsunori Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

work page 2024
[59]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, Findings of ACL, pages 20192–20204, 2025

work page 2025
[60]

The expressive power of transformers with chain of thought

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024. 14

work page 2024
[61]

A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

work page arXiv 2025
[62]

Can a suit of armor conduct electricity? A new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018
[63]

Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference

Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

work page 2025
[64]

Memory-efficient optimization with factorized Hamiltonian descent

Son Nguyen, Lizhang Chen, Bo Liu, and Qiang Liu. Memory-efficient optimization with factorized Hamiltonian descent. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 2863–2871, 2025

work page 2025
[65]

Improving adaptive moment optimization via preconditioner diagonalization

Son Nguyen, Bo Liu, Lizhang Chen, and Qiang Liu. Improving adaptive moment optimization via preconditioner diagonalization. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2026, 2026

work page 2026
[66]

Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025

Mohammadmahdi Nouriborji, Morteza Rohanian, and Omid Rohanian. Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025

work page arXiv 2025
[67]

Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, and Sham M. Kakade. The recurrent transformer: Greater effective depth and efficient decoding. CoRR, abs/2604.21215, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, CHIL 2022, Proceedings of Machine Learning Research, pages 248–260, 2022

work page 2022
[69]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, 2016

work page 2016
[70]

Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

Francesco Pappone, Donato Crisostomi, and Emanuele Rodolà. Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

work page arXiv 2025
[71]

Kingma, and Qiang Liu

Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, and Qiang Liu. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026
[72]

Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden computation in transformer language models.CoRR, abs/2404.15758, 2024

work page arXiv 2024
[73]

Boris T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964

work page 1964
[74]

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models.CoRR, abs/2604.12946, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[75]

Subformer: Exploring weight sharing for parameter efficiency in generative transformers

Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. InFindings of the Association for Computational Linguistics: EMNLP 2021, Findings of ACL, pages 4081–4090, 2021

work page 2021
[76]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[77]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reason- ing with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 15

work page 2025
[78]

Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 6695–6706, 2021

work page 2021
[79]

The curious case of AdamW, 2026

Baiyu Su, Lizhang Chen, and Qiang Liu. The curious case of AdamW, 2026

work page 2026
[80]

Lessons on parameter sharing across layers in transformers

Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2023, pages 78–90, 2023

work page 2023

Showing first 80 references.

[1] [1]

Transformers learn to implement preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

work page 2023

[2] [2]

Alexander C. Aitken. On Bernoulli’s numerical solution of algebraic equations.Proceedings of the Royal Society of Edinburgh, 46:289–305, 1927

work page 1927

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Relaxed recursive transformers: Effective parameter sharing with layer-wise lora

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

work page 2025

[5] [5]

Courville, and Se-Young Yun

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron C. Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. InAdvances in Neu- ral Information Processing Systems 39: Annual Conference on Neural Information Processing...

work page 2025

[6] [6]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 688–699, 2019

work page 2019

[7] [7]

Zico Kolter

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale deep equilibrium models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Infor- mation Processing Systems 2020, NeurIPS 2020, 2020

work page 2020

[8] [8]

Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021

Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.CoRR, abs/2107.05407, 2021

work page arXiv 2021

[9] [9]

End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking

Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Gold- blum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapola- tion without overthinking. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022

work page 2022

[10] [10]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.CoRR, abs/2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Butcher.Numerical Methods for Ordinary Differential Equations

John C. Butcher.Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, 3rd edition, 2016. 11

work page 2016

[12] [12]

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

Bo Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, and Zhao Song. Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 4447–4455, 2025

work page 2025

[13] [13]

Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026

Guanxu Chen, Dongrui Liu, and Jing Shao. Loop as a bridge: Can looped transformers truly link representation space and natural language outputs?CoRR, abs/2601.10242, 2026

work page arXiv 2026

[14] [14]

Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026

Hung-Hsuan Chen. Thinking deeper, not longer: Depth-recurrent transformers for composi- tional generalization.CoRR, abs/2603.21676, 2026

work page arXiv 2026

[15] [15]

Demystifying LION: a Hamiltonian perspective

Lizhang Chen. Demystifying LION: a Hamiltonian perspective. Master’s thesis, The University of Texas at Austin, 2025

work page 2025

[16] [16]

Cautious weight decay

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026

[17] [17]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. Trans. Mach. Learn. Res., 2026, 2026

work page 2026

[18] [18]

ϕ-balancing for mixture-of-experts training

Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, and Qiang Liu. ϕ-balancing for mixture-of-experts training. InForty-third International Conference on Machine Learning, ICML 2026, 2026

work page 2026

[19] [19]

Lion secretly solves a constrained optimization: As Lyapunov predicts

Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

work page 2024

[20] [20]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.CoRR, abs/2412.13171, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge.CoRR, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Man- ning

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, and Christopher D. Man- ning. Moeut: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

work page 2024

[24] [24]

Simulation of graph algorithms with looped transformers

Artur Back de Luca and Kimon Fountoulakis. Simulation of graph algorithms with looped transformers. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 2319–2363, 2024

work page 2024

[25] [25]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In7th International Conference on Learning Representations, ICLR 2019, 2019

work page 2019

[28] [28]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 12

work page 2025

[29] [29]

Towards revealing the mystery behind chain of thought: A theoretical perspective

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023

work page 2023

[30] [30]

A framework for few-shot language model evaluation, 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2023

[31] [31]

Algoformer: An efficient transformer framework with algorithmic structures.Trans

Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael Ng, Zhenguo Li, and Zhaoqiang Liu. Algoformer: An efficient transformer framework with algorithmic structures.Trans. Mach. Learn. Res., 2025, 2025

work page 2025

[32] [32]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning Research, pages 15130–15152, 2024

work page 2024

[33] [33]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.CoRR, abs/2502.05171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025

Jonas Geiping, Xinyu Yang, and Guinan Su. Efficient parallel samplers for recurrent-depth models and their connection to diffusion language models.CoRR, abs/2510.14961, 2025

work page arXiv 2025

[35] [35]

Lee, and Dimitris Papailiopoulos

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, ICML 2023, Proceedings of Machine Learning Research, pages 11398–11442, 2023

work page 2023

[36] [36]

What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025

Zixuan Gong, Jiaye Teng, and Yong Liu. What makes looped transformers perform better than non-recursive ones (provably).CoRR, abs/2510.10089, 2025

work page arXiv 2025

[37] [37]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

work page 2024

[38] [38]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.CoRR, abs/1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[39] [39]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuan- dong Tian. Training large language models to reason in a continuous latent space.CoRR, abs/2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, 2021

work page 2021

[41] [41]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proces...

work page 2023

[42] [42]

Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation

Ahmadreza Jeddi, Marco Ciccone, and Babak Taati. Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026

[43] [43]

Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, and Kristian Kersting. Depth-recurrent attention mixtures: Giving latent reasoning the attention it deserves.CoRR, abs/2601.21582, 2026. 13

work page arXiv 2026

[44] [44]

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.CoRR, abs/2604.07822, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025

Yeskendir Koishekenov, Aldo Lipani, and Nicola Cancedda. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.CoRR, abs/2510.07358, 2025

work page arXiv 2025

[46] [46]

arXiv preprint arXiv:2406.19384 , year=

Vedang Lad, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?CoRR, abs/2406.19384, 2024

work page arXiv 2024

[47] [47]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, 2020

work page 2020

[48] [48]

Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026

Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi-Pari, Lizhang Chen, Amy Zhang, and Liu Leqi. Learning robust reasoning through guided adversarial self-play.CoRR, abs/2602.00173, 2026

work page arXiv 2026

[49] [49]

Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026

Jia Liang and Liangming Pan. Do latent-cot models think step-by-step? A mechanistic study on sequential reasoning tasks.CoRR, abs/2602.00449, 2026

work page arXiv 2026

[50] [50]

Cautious optimizers: Improving training with one line of code

Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026

[51] [51]

Memory-efficient LLM training with online subspace descent

Kaizhao Liang, Bo Liu, Lizhang Chen, and Qiang Liu. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

work page 2024

[52] [52]

Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026

Runlong Liao, Jian Yu, Baiyu Su, Chi Zhang, Lizhang Chen, and Qiang Liu. Momentum guidance: Plug-and-play guidance for flow models.CoRR, abs/2602.20360, 2026

work page arXiv 2026

[53] [53]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214–3252, 2022

work page 2022

[54] [54]

Communication efficient distributed training with distributed Lion

Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, and Qiang Liu. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, 2024

work page 2024

[55] [55]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025

Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, and Enqi Liu. Latent chain-of-thought? decoding the depth-recurrent transformer.CoRR, abs/2507.02199, 2025

work page arXiv 2025

[57] [57]

Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, and Ghouthi Boukli Hacene. Inner loop inference for pretrained transformers: Unlocking latent capabilities without training.CoRR, abs/2602.14759, 2026

work page arXiv 2026

[58] [58]

Mahankali, Tatsunori Hashimoto, and Tengyu Ma

Arvind V . Mahankali, Tatsunori Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024

work page 2024

[59] [59]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics, ACL 2025, Findings of ACL, pages 20192–20204, 2025

work page 2025

[60] [60]

The expressive power of transformers with chain of thought

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, ICLR 2024, 2024. 14

work page 2024

[61] [61]

A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers.CoRR, abs/2503.03961, 2025

work page arXiv 2025

[62] [62]

Can a suit of armor conduct electricity? A new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018

work page 2018

[63] [63]

Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference

Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain of thought driven architecture with budget-adaptive computation cost at inference. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025

work page 2025

[64] [64]

Memory-efficient optimization with factorized Hamiltonian descent

Son Nguyen, Lizhang Chen, Bo Liu, and Qiang Liu. Memory-efficient optimization with factorized Hamiltonian descent. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learning Research, pages 2863–2871, 2025

work page 2025

[65] [65]

Improving adaptive moment optimization via preconditioner diagonalization

Son Nguyen, Bo Liu, Lizhang Chen, and Qiang Liu. Improving adaptive moment optimization via preconditioner diagonalization. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2026, 2026

work page 2026

[66] [66]

Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025

Mohammadmahdi Nouriborji, Morteza Rohanian, and Omid Rohanian. Improving recursive transformers with mixture of loras.CoRR, abs/2512.12880, 2025

work page arXiv 2025

[67] [67]

Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, and Sham M. Kakade. The recurrent transformer: Greater effective depth and efficient decoding. CoRR, abs/2604.21215, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[68] [68]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, CHIL 2022, Proceedings of Machine Learning Research, pages 248–260, 2022

work page 2022

[69] [69]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, 2016

work page 2016

[70] [70]

Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

Francesco Pappone, Donato Crisostomi, and Emanuele Rodolà. Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

work page arXiv 2025

[71] [71]

Kingma, and Qiang Liu

Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, and Qiang Liu. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026

work page 2026

[72] [72]

Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden computation in transformer language models.CoRR, abs/2404.15758, 2024

work page arXiv 2024

[73] [73]

Boris T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964

work page 1964

[74] [74]

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models.CoRR, abs/2604.12946, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[75] [75]

Subformer: Exploring weight sharing for parameter efficiency in generative transformers

Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. InFindings of the Association for Computational Linguistics: EMNLP 2021, Findings of ACL, pages 4081–4090, 2021

work page 2021

[76] [76]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[77] [77]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reason- ing with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025. 15

work page 2025

[78] [78]

Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pages 6695–6706, 2021

work page 2021

[79] [79]

The curious case of AdamW, 2026

Baiyu Su, Lizhang Chen, and Qiang Liu. The curious case of AdamW, 2026

work page 2026

[80] [80]

Lessons on parameter sharing across layers in transformers

Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2023, pages 78–90, 2023

work page 2023