Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

Anej Svete; Kangwook Lee; Ying Fan

arxiv: 2606.31779 · v1 · pith:RHATADYYnew · submitted 2026-06-30 · 💻 cs.LG · cs.CL

Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

Ying Fan , Anej Svete , Kangwook Lee This is my paper

Pith reviewed 2026-07-01 06:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords latent chain-of-thoughtlooped transformersreasoninglanguage modelsefficiencyparallel supervision

0 comments

The pith

Looped transformers with parallel supervision on latent positions match explicit chain-of-thought performance at 3B scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a looped Transformer can close the performance gap between latent and explicit reasoning by processing multiple latent blocks in parallel across iterations while supervising each block directly with gold chain-of-thought tokens. This setup reuses the same weights to deepen computation without adding parameters and produces hidden states whose projections recover the original reasoning steps. A reader would care because it shows latent reasoning can scale to 3 billion parameters with substantial reductions in the time spent generating intermediate steps. The approach requires gold tokens during training but then operates without emitting them at inference time.

Core claim

A looped padded Transformer that runs K latent blocks in parallel for R iterations, trained with cross-entropy loss on each latent position's corresponding gold CoT-step token, achieves explicit-CoT-level accuracy at the 3B scale while reducing thought-phase latency by 2.5x-6.9x. Projecting the final latent states through the base language-model head recovers the gold steps and can surface alternative valid intermediates, indicating that the latent space remains interpretable and aligned with explicit reasoning. Ablations show that both the looped backbone and the parallel gold-token supervision are required for the result.

What carries the argument

Looped padded Transformer with parallel cross-entropy supervision on gold CoT-step tokens for each of K latent blocks across R iterations, which reuses weights to increase effective depth while aligning hidden states to explicit reasoning steps.

If this is right

Latent CoT can reach explicit CoT performance at 3B parameters instead of lagging behind as seen in prior methods.
Thought-phase latency drops by factors between 2.5x and 6.9x when moving from compact math expressions to natural language outputs.
The final latent representations remain decodable into explicit steps, preserving interpretability without extra training.
Both the recurrent-depth structure and the parallel gold-token loss are necessary; removing either collapses the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If gold steps are available only for a subset of training data, the method might still transfer to new domains by mixing supervised and unsupervised loops.
The ability to surface alternative valid steps suggests the latent space could support search or verification procedures that explicit decoding cannot easily do.
Increasing R at inference time without retraining could provide a tunable accuracy-latency trade-off not available in standard Transformers.

Load-bearing premise

Gold chain-of-thought step tokens can be supplied at training time and this direct supervision is enough to make the looped latent states functionally equivalent to explicit token steps.

What would settle it

If the post-loop latent states, when passed through the base LM head, fail to recover the gold reasoning steps or produce valid alternative steps on a test set, or if end-task accuracy at 3B scale remains below explicit CoT.

Figures

Figures reproduced from arXiv: 2606.31779 by Anej Svete, Kangwook Lee, Ying Fan.

**Figure 2.** Figure 2: LOTUS architecture. (a) Looped forward: the looped LM [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LOTUS-aux supervision at loop iteration t for block t. The auxiliary decoder gϕ only replaces the base LM head supervision in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-path readout probability for the example in Appendix [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

read the original abstract

Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model's hidden states, replacing decoded tokens with continuous representations for greater efficiency. However, existing latent CoT methods underperform explicit CoT beyond 1B parameters, and the gap widens with scale. Looped, or recurrent-depth, Transformers, which reuse their weights to increase computation depth without adding parameters, are a natural fit for latent reasoning. We therefore ask whether looped Transformers can bridge this gap. We answer affirmatively with a simple recipe: a looped padded Transformer that processes K latent blocks in parallel for R iterations, with a cross-entropy loss on each latent position's gold CoT-step token, similar to explicit CoT supervision. We instantiate it as LOTUS (Looped Transformers with parallel supervision on latents). LOTUS is, to our knowledge, the first latent-CoT method to bridge the gap to explicit CoT at the 3B scale, while cutting thought-phase latency by 2.5x-6.9x from compact math expressions to natural language. Projecting LOTUS's post-loop latents through the base LM head recovers the gold reasoning steps and even surfaces alternative valid intermediate steps, evidence that its latent space is interpretable and CoT-aligned. Ablations confirm that both the looped backbone and the parallel supervision on gold CoT tokens are essential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LOTUS gets latent CoT to match explicit CoT at 3B scale via looped depth plus parallel gold-token supervision, but the abstract leaves the relative contribution of each unclear.

read the letter

The main takeaway is that a looped padded Transformer with cross-entropy loss on gold CoT tokens at every latent position reaches explicit-CoT accuracy at 3B parameters while cutting thought-phase latency by 2.5x-6.9x. The paper also shows that the final latent states can be projected back through the LM head to recover the gold steps and sometimes valid alternatives.

What is new is the concrete recipe: K latent blocks processed in parallel across R iterations, with direct token-level supervision on the gold reasoning sequence. Prior latent-CoT work reportedly fell short beyond 1B, so closing the gap at this scale with a simple recurrent-depth change plus the supervision is the central result. The claim that both the looped backbone and the parallel supervision are required comes from ablations mentioned in the abstract.

The soft spot is that the abstract supplies no accuracy tables, baseline comparisons, error bars, or dataset sizes. Without those numbers it is difficult to judge how large the drops are when either component is removed. The stress-test concern therefore still applies on the evidence given: the gold-token supervision supplies explicit step signals to every block, which could be the main reason performance improves rather than any emergent multi-step computation inside the loop. The paper states the ablations confirm both pieces matter, but the magnitudes are not reported here.

This is for groups working on parameter-efficient reasoning or recurrent-depth architectures. A reader who wants a practical recipe to test at 3B scale will find the description clear enough to reproduce. The work deserves a serious referee because the claim is specific, the method is straightforward, and the efficiency angle is relevant even if the experiments need fuller reporting.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LOTUS, a looped padded Transformer that processes K latent blocks in parallel over R iterations while applying cross-entropy supervision on gold CoT-step tokens at each latent position. It claims this architecture is the first latent-CoT method to match explicit CoT performance at the 3B scale, reduces thought-phase latency by 2.5x–6.9x, and yields interpretable post-loop latents that recover gold reasoning steps (and valid alternatives) when projected through the base LM head. Ablations are said to show that both the recurrent-depth backbone and the parallel gold-token supervision are required.

Significance. If the empirical claims hold under rigorous controls, the result would be significant: it would demonstrate that recurrent-depth Transformers plus direct latent supervision can close the long-standing performance gap between latent and explicit reasoning at practical scales, while delivering measurable latency gains and partial interpretability of the latent states.

major comments (3)

[Experiments] Experiments section (and associated tables/figures): the central claim that LOTUS bridges the gap at 3B scale requires explicit accuracy numbers, baseline comparisons (including prior latent-CoT methods), dataset sizes, number of runs, and error bars or statistical tests; the abstract states results and ablations but supplies none of these details, leaving the magnitude and reliability of the bridging effect unsupported.
[Ablations] Ablation study (presumably §4 or §5): the claim that the looped backbone is essential (rather than the parallel gold-token supervision being the dominant driver) must be supported by a controlled ablation that removes the gold CoT-token loss while retaining the looped structure; if accuracy then falls to the level of prior latent-CoT methods, the bridging result is attributable to the supervision regime, not the recurrent-depth innovation.
[Method] Method description (K and R): the architecture is parameterized by the number of latent blocks K and iterations R; the paper must report how these values were chosen, whether they are held constant across scales, and whether performance remains stable when they are varied, because the abstract presents them as free parameters without quantifying sensitivity.

minor comments (2)

[Abstract] The abstract states latency reductions of 2.5x–6.9x “from compact math expressions to natural language” without defining the measurement protocol (wall-clock time, token count, or hardware); this should be clarified in the main text and figures.
[Method] Notation for the parallel cross-entropy loss on latent positions is introduced only informally; an explicit equation would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and rigor where needed.

read point-by-point responses

Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that LOTUS bridges the gap at 3B scale requires explicit accuracy numbers, baseline comparisons (including prior latent-CoT methods), dataset sizes, number of runs, and error bars or statistical tests; the abstract states results and ablations but supplies none of these details, leaving the magnitude and reliability of the bridging effect unsupported.

Authors: The full manuscript (Sections 4–5 and associated tables) already reports explicit accuracy numbers for LOTUS versus explicit CoT and prior latent-CoT baselines at the 3B scale, dataset sizes, multiple runs, and error bars. However, we agree the abstract would be strengthened by surfacing key quantitative results. We will revise the abstract to include representative accuracy figures, baseline comparisons, and a note on statistical reliability, while ensuring all tables explicitly list run counts and error bars. revision: yes
Referee: [Ablations] Ablation study (presumably §4 or §5): the claim that the looped backbone is essential (rather than the parallel gold-token supervision being the dominant driver) must be supported by a controlled ablation that removes the gold CoT-token loss while retaining the looped structure; if accuracy then falls to the level of prior latent-CoT methods, the bridging result is attributable to the supervision regime, not the recurrent-depth innovation.

Authors: The existing ablations demonstrate necessity of both the looped backbone and parallel supervision. We acknowledge that an explicit controlled ablation removing only the gold CoT-token loss (while retaining the looped structure) would more cleanly isolate the recurrent-depth contribution. We will add this experiment to the revised manuscript. revision: yes
Referee: [Method] Method description (K and R): the architecture is parameterized by the number of latent blocks K and iterations R; the paper must report how these values were chosen, whether they are held constant across scales, and whether performance remains stable when they are varied, because the abstract presents them as free parameters without quantifying sensitivity.

Authors: We will expand the Method section to describe the selection criteria for K and R, confirm they are held fixed across scales in the reported experiments, and add a sensitivity analysis quantifying performance stability under reasonable variations of these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by ablations

full rationale

The paper describes LOTUS as a concrete training recipe (looped padded Transformer with parallel cross-entropy on gold CoT-step tokens) whose performance claims rest on scale-specific experiments and component ablations at 3B parameters. No equations are presented that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation chain or imported uniqueness theorem. The method is self-contained through direct empirical comparison to explicit CoT baselines rather than any derivation that collapses to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the approach relies on standard transformer assumptions plus the design choice of looping and parallel supervision; no new physical entities or heavily fitted constants are introduced.

free parameters (2)

K (latent blocks)
Architectural hyperparameter controlling parallelism of latent positions.
R (iterations)
Number of weight-reuse loops controlling effective depth.

axioms (1)

domain assumption Looped (recurrent-depth) Transformers increase effective computation depth without adding parameters.
Invoked as the backbone enabling latent multi-step reasoning.

pith-pipeline@v0.9.1-grok · 5798 in / 1356 out tokens · 29928 ms · 2026-07-01T06:49:05.010663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

144 extracted references · 62 canonical work pages · 21 internal anchors

[2]

2026 , url=

Ayhan Suleymanzade and Halil Alperen Gozeten and Ismail Ilkan Ceylan and Jinwoo Kim , booktitle=. 2026 , url=

2026
[3]

KaVa: Latent Reasoning via Compressed

Anna Kuzina and Maciej Pi. KaVa: Latent Reasoning via Compressed. The Fourteenth International Conference on Learning Representations , year=
[4]

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , url =

Ahmadreza Jeddi and Marco Ciccone and Babak Taati , booktitle =. LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , url =
[6]

Lightweight Latent Reasoning for Narrative Tasks

Lightweight Latent Reasoning for Narrative Tasks , url =. arXiv , author =:2512.02240 , journal =

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Neural Networks as Recognizers of Formal Languages , url =

Alexandra Butoi and Ghazal Khalighinejad and Anej Svete and Josef Valvoda and Ryan Cotterell and Brian DuSell , booktitle =. Training Neural Networks as Recognizers of Formal Languages , url =
[8]

Language Modeling with Learned Meta-Tokens , url =

Alok Shah and Khush Gupta and Keshav Ramji and Pratik Chaudhari , booktitle =. Language Modeling with Learned Meta-Tokens , url =
[9]

Probability Distributions Computed by Autoregressive Transformers , url =

Andy Yang and Anej Svete and Jiaoda Li and Anthony Widjaja Lin and Jonathan Rawski and Ryan Cotterell and David Chiang , booktitle =. Probability Distributions Computed by Autoregressive Transformers , url =
[10]

The Transformer Cookbook , url =

Andy Yang and Christopher Watson and Anton Xue and Satwik Bhattamishra and Jose Llarena and William Merrill and Emile Dos Santos Ferreira and Anej Svete and David Chiang , issn =. The Transformer Cookbook , url =. Transactions on Machine Learning Research , note =
[11]

On the Reasoning Abilities of Masked Diffusion Language Models , url =

Anej Svete and Ashish Sabharwal , booktitle =. On the Reasoning Abilities of Masked Diffusion Language Models , url =
[13]

Limits of Continuous Chain-of-Thought in Multi-Step and Multi-Chain Reasoning , url =

Ayhan Suleymanzade and Andreas Bergmeister and Stefanie Jegelka , booktitle =. Limits of Continuous Chain-of-Thought in Multi-Step and Multi-Chain Reasoning , url =
[15]

Boyi Zeng and Shixiang Song and Siyuan Huang and Yixuan Wang and He Li and Ziwei He and Xinbing Wang and Zhiyu li and Zhouhan Lin , booktitle =. Ponder
[16]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , url =. arXiv , author =:2408.03314 , journal =

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , url =

DiJia Su and Hanlin Zhu and Yingchen Xu and Jiantao Jiao and Yuandong Tian and Qinqing Zheng , booktitle =. Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , url =
[21]

and Papailiopoulos, Dimitris , booktitle =

Giannou, Angeliki and Rajput, Shashank and Sohn, Jy-Yong and Lee, Kangwook and Lee, Jason D. and Papailiopoulos, Dimitris , booktitle =. Looped Transformers as Programmable Computers , url =
[22]

Neural Networks and the Chomsky Hierarchy , url =

Gregoire Deletang and Anian Ruoss and Jordi Grau-Moya and Tim Genewein and Li Kevin Wenliang and Elliot Catt and Chris Cundy and Marcus Hutter and Shane Legg and Joel Veness and Pedro A Ortega , booktitle =. Neural Networks and the Chomsky Hierarchy , url =
[23]

arXiv , author =:2502.12214 , journal =

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement , url =. arXiv , author =:2502.12214 , journal =

work page arXiv
[25]

Continuous Chain of Thought Enables Parallel Exploration and Reasoning , url =

Halil Alperen Gozeten and Muhammed Emrullah Ildiz and Xuechen Zhang and Hrayr Harutyunyan and Ankit Singh Rawat and Samet Oymak , booktitle =. Continuous Chain of Thought Enables Parallel Exploration and Reasoning , url =
[26]

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought , url =

Hanlin Zhu and Shibo Hao and Zhiting Hu and Jiantao Jiao and Stuart Russell and Yuandong Tian , booktitle =. Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought , url =
[27]

LaDiR: Latent Diffusion Enhances

Haoqiang Kang and Yizhe Zhang and Nikki Lijing Kuang and Nicklas Majamaki and Navdeep Jaitly and Yian Ma and Lianhui Qin , booktitle =. LaDiR: Latent Diffusion Enhances
[32]

Let's Verify Step by Step

Let's Verify Step by Step , url =. arXiv , author =:2305.20050 , journal =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

arXiv , author =:2604.02051 , journal =

Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation , url =. arXiv , author =:2604.02051 , journal =

work page arXiv
[36]

Bowman , booktitle =

Jacob Pfau and William Merrill and Samuel R. Bowman , booktitle =. Let
[38]

Unique Hard Attention: A Tale of Two Sides , url =

Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan , booktitle =. Unique Hard Attention: A Tale of Two Sides , url =. doi:10.18653/v1/2025.acl-short.76 , editor =

work page doi:10.18653/v1/2025.acl-short.76 2025
[44]

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connections to Diffusion Language Models , url =

Jonas Geiping and Xinyu Yang and Guinan Su , booktitle =. Efficient Parallel Samplers for Recurrent-Depth Models and Their Connections to Diffusion Language Models , url =
[49]

A Formal Comparison Between Chain-of-Thought and Latent Thought , url =

Kevin Xu and Issei Sato , journal =. A Formal Comparison Between Chain-of-Thought and Latent Thought , url =
[52]

Deliberation in Latent Space via Differentiable Cache Augmentation , url =

Luyang Liu and Jonas Pfeiffer and Jiaxing Wu and Jun Xie and Arthur Szlam , booktitle =. Deliberation in Latent Space via Differentiable Cache Augmentation , url =
[54]

Dynamic Parameter Reuse Augments Reasoning via Latent Chain of Thought , url =

Maile, Kaitlin and Sacramento, João , booktitle =. Dynamic Parameter Reuse Augments Reasoning via Latent Chain of Thought , url =
[55]

The Illusion of Superposition in Latent CoT via Soft Thinking , url =

Michael Rizvi-Martel and Marius Mosbach , booktitle =. The Illusion of Superposition in Latent CoT via Soft Thinking , url =
[57]

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages , year =

Nadav Borenstein and Anej Svete and Robin Chan and Josef Valvoda and Franz Nowak and Isabelle Augenstein and Eleanor Chodroff and Ryan Cotterell , month = aug, publisher =. What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages , year =
[58]

s1: Simple test-time scaling

s1: Simple test-time scaling , url =. arXiv , author =:2501.19393 , journal =

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Reddi , booktitle =

Nikunj Saunshi and Nishanth Dikkala and Zhiyuan Li and Sanjiv Kumar and Sashank J. Reddi , booktitle =. Reasoning with Latent Thoughts: On the Power of Looped Transformers , url =
[62]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , title =

R. The Thirty-ninth Annual Conference on Neural Information Processing Systems , title =
[63]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =

R. The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =
[65]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =

Robin Chan and Reda Boumasmoud and Anej Svete and Yuxin Ren and Qipeng Guo and Zhijing Jin and Shauli Ravfogel and Mrinmaya Sachan and Bernhard Sch. The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =
[69]

arXiv , author =:2311.04329 , primaryclass =

Formal Aspects of Language Modeling , url =. arXiv , author =:2311.04329 , primaryclass =

work page arXiv
[70]

Think before you speak: Training Language Models With Pause Tokens , url =

Sachin Goyal and Ziwei Ji and Ankit Singh Rawat and Aditya Krishna Menon and Sanjiv Kumar and Vaishnavh Nagarajan , booktitle =. Think before you speak: Training Language Models With Pause Tokens , url =
[71]

Clair and Paul Fodor and Chihiro Shibata and Jeffrey Heinz , journal =

Sam van der Poel and Dakotah Lambert and Kalina Kostyszyn and Tiantian Gao and Rahul Verma and Derek Andersen and Joanne Chau and Emily Peterson and Cody St. Clair and Paul Fodor and Chihiro Shibata and Jeffrey Heinz , journal =. MLRegTest: A Benchmark for the Machine Learning of Regular Languages , url =
[72]

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise Lo

Sangmin Bae and Adam Fisch and Hrayr Harutyunyan and Ziwei Ji and Seungyeon Kim and Tal Schuster , booktitle =. Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise Lo
[73]

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , url =

Sangmin Bae and Yujin Kim and Reza Bayat and Sungnyun Kim and Jiyoun Ha and Tal Schuster and Adam Fisch and Hrayr Harutyunyan and Ziwei Ji and Aaron Courville and Se-Young Yun , booktitle =. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , url =
[77]

Training Large Language Models to Reason in a Continuous Latent Space , url =

Shibo Hao and Sainbayar Sukhbaatar and DiJia Su and Xian Li and Zhiting Hu and Jason E Weston and Yuandong Tian , booktitle =. Training Large Language Models to Reason in a Continuous Latent Space , url =
[80]

Targeted Syntactic Evaluation on the

Someya, Taiga and Yoshida, Ryo and Oseki, Yohei , booktitle =. Targeted Syntactic Evaluation on the
[81]

What Formal Languages Can Transformers Express? A Survey , url =

Strobl, Lena and Merrill, William and Weiss, Gail and Chiang, David and Angluin, Dana , doi =. What Formal Languages Can Transformers Express? A Survey , url =. Transactions of the Association for Computational Linguistics , pages =
[83]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[84]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , booktitle =. Chain-of-thought prompting elicits reasoning in large language models , year =
[85]

Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization , url =

Wengao Ye and Yan Liang and Lianlei Shan , booktitle =. Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization , url =
[86]

Think Silently, Think Fast: Dynamic Latent Compression of

Wenhui Tan and Jiaze Li and Jianzhong Ju and Zhenbo Luo and Ruihua Song and Jian Luan , booktitle =. Think Silently, Think Fast: Dynamic Latent Compression of
[87]

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers , url =

William Merrill and Ashish Sabharwal , booktitle =. A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers , url =
[88]

Exact Expressive Power of Transformers with Padding , url =

William Merrill and Ashish Sabharwal , booktitle =. Exact Expressive Power of Transformers with Padding , url =
[89]

The Expressive Power of Transformers with Chain of Thought , url =

William Merrill and Ashish Sabharwal , booktitle =. The Expressive Power of Transformers with Chain of Thought , url =
[94]

Guiding Language Model Reasoning with Planning Tokens , url =

Xinyi Wang and Lucas Caccia and Oleksiy Ostapenko and Xingdi Yuan and William Yang Wang and Alessandro Sordoni , booktitle =. Guiding Language Model Reasoning with Planning Tokens , url =
[97]

AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures , url =

Yihang Gao and Chuanyang Zheng and Enze Xie and Han Shi and Tianyang Hu and Yu Li and Michael Ng and Zhenguo Li and Zhaoqiang Liu , issn =. AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures , url =. Transactions on Machine Learning Research , note =
[98]

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy , url =. arXiv , author =:2604.02709 , journal =

work page internal anchor Pith review Pith/arXiv arXiv
[99]

Looped Transformers for Length Generalization , url =

Ying Fan and Yilun Du and Kannan Ramchandran and Kangwook Lee , booktitle =. Looped Transformers for Length Generalization , url =
[101]

SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens , url =

Yinhan He and Wendy Zheng and Yaochen Zhu and Zaiyi Zheng and Lin Su and Sriram Vasudevan and Qi Guo and Liangjie Hong and Jundong Li , booktitle =. SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens , url =
[102]

Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , url =

Yu, Qifan and He, Zhenyu and Li, Sijie and Xun, Zhou and Zhang, Jun and Xu, Jingjing and He, Di , booktitle =. Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , url =. doi:10.18653/v1/2026.eacl-long.97 , editor =

work page doi:10.18653/v1/2026.eacl-long.97 2026
[106]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =

Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , booktitle =. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =
[107]

Language models are unsupervised multitask learners , author=
[109]

Unlocking out-of-distribution generalization in transformers via recursive latent space reasoning

Awni Altabaa, Siyu Chen, John Lafferty, and Zhuoran Yang. Unlocking out-of-distribution generalization in transformers via recursive latent space reasoning. arXiv preprint arXiv:2510.14095, 2025. URL https://arxiv.org/abs/2510.14095

work page arXiv 2025
[110]

Latent reasoning with supervised thinking states

Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, and Idan Szpektor. Latent reasoning with supervised thinking states. arXiv preprint arXiv:2602.08332, 2026. URL https://arxiv.org/abs/2602.08332

work page arXiv 2026
[111]

Relaxed recursive transformers: Effective parameter sharing with layer-wise lo RA

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lo RA . In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=WwpYSOkkCt

2025
[112]

Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview...

2025
[113]

Thinking deeper, not longer: Depth-recurrent transformers for compositional generalization

Hung-Hsuan Chen. Thinking deeper, not longer: Depth-recurrent transformers for compositional generalization. arXiv preprint arXiv:2603.21676, 2026. URL https://arxiv.org/abs/2603.21676

work page arXiv 2026
[114]

arXiv preprint arXiv:2505.16782 (2025)

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning. arXiv preprint arXiv:2505.16782, 2025 a . URL https://arxiv.org/abs/2505.16782

work page arXiv 2025
[115]

Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting o...

work page doi:10.18653/v1/2025.acl-long.1369 2025
[116]

Think consistently, reason efficiently: Energy-based calibration for implicit chain-of-thought

Zhikang Chen, Sen Cui, Deheng Ye, Yu Zhang, Yatao Bian, and Tingting Zhu. Think consistently, reason efficiently: Energy-based calibration for implicit chain-of-thought. arXiv preprint arXiv:2511.07124, 2025 c . URL https://arxiv.org/abs/2511.07124

work page arXiv 2025
[117]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171, 2024. URL https://arxiv.org/abs/2412.13171

work page internal anchor Pith review Pith/arXiv arXiv 2024
[118]

Spot: Span-level pause-of-thought for efficient and interpretable latent reasoning in large language models

Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin, Jialu Wang, and Ruijie Wang. Spot: Span-level pause-of-thought for efficient and interpretable latent reasoning in large language models. arXiv preprint arXiv:2603.06222, 2026. URL https://arxiv.org/abs/2603.06222

work page arXiv 2026
[119]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[120]

Mo EUT : Mixture-of-experts universal transformers

R \'o bert Csord \'a s, Kazuki Irie, J \"u rgen Schmidhuber, Christopher Potts, and Christopher D Manning. Mo EUT : Mixture-of-experts universal transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=ZxVrkm7Bjl

2024
[121]

Do language models use their depth efficiently? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

R \'o bert Csord \'a s, Christopher D Manning, and Christopher Potts. Do language models use their depth efficiently? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Kz6eUL86XP

2025
[122]

How do latent reasoning methods perform under weak and strong supervision? arXiv preprint arXiv:2602.22441, 2026

Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu, Rui Sun, Zhiji Liu, Yue Xing, Jiliang Tang, and Benoit Dumoulin. How do latent reasoning methods perform under weak and strong supervision? arXiv preprint arXiv:2602.22441, 2026. URL https://arxiv.org/abs/2602.22441

work page arXiv 2026
[123]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645 0 (8081): 0 633--638, Sep 2025. ISSN 1476-4687. doi:10.1038/s41586-025-09422-z. URL https://doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[124]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2019. URL https://arxiv.org/abs/1807.03819

work page internal anchor Pith review Pith/arXiv arXiv 2019
[125]

Llm latent reasoning as chain of superposition

Jingcheng Deng, Liang Pang, Zihao Wei, Shicheng Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Llm latent reasoning as chain of superposition. arXiv preprint arXiv:2510.15522, 2026. URL https://arxiv.org/abs/2510.15522

work page arXiv 2026
[126]

Implicit chain of thought reasoning via knowledge distillation

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

work page arXiv 2023
[127]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=2edigk8yoU

2025
[128]

Efficient reasoning models: A survey

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey. arXiv preprint arXiv:2504.10903, 2025. URL https://arxiv.org/abs/2504.10903

work page arXiv 2025
[129]

SeLaR: Selective Latent Reasoning in Large Language Models

Renyu Fu and Guibo Luo. Selar: Selective latent reasoning in large language models. arXiv preprint arXiv:2604.08299, 2026. URL https://arxiv.org/abs/2604.08299

work page internal anchor Pith review Pith/arXiv arXiv 2026
[130]

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, and Yu Wang. Think-at-hard: Selective latent iterations to improve reasoning language models. arXiv preprint arXiv:2511.08577, 2025. URL https://arxiv.org/abs/2511.08577

work page internal anchor Pith review Pith/arXiv arXiv 2025
[131]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025 a . URL https://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [2]

2026 , url=

Ayhan Suleymanzade and Halil Alperen Gozeten and Ismail Ilkan Ceylan and Jinwoo Kim , booktitle=. 2026 , url=

2026

[2] [3]

KaVa: Latent Reasoning via Compressed

Anna Kuzina and Maciej Pi. KaVa: Latent Reasoning via Compressed. The Fourteenth International Conference on Learning Representations , year=

[3] [4]

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , url =

Ahmadreza Jeddi and Marco Ciccone and Babak Taati , booktitle =. LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , url =

[4] [6]

Lightweight Latent Reasoning for Narrative Tasks

Lightweight Latent Reasoning for Narrative Tasks , url =. arXiv , author =:2512.02240 , journal =

work page internal anchor Pith review Pith/arXiv arXiv

[5] [7]

Training Neural Networks as Recognizers of Formal Languages , url =

Alexandra Butoi and Ghazal Khalighinejad and Anej Svete and Josef Valvoda and Ryan Cotterell and Brian DuSell , booktitle =. Training Neural Networks as Recognizers of Formal Languages , url =

[6] [8]

Language Modeling with Learned Meta-Tokens , url =

Alok Shah and Khush Gupta and Keshav Ramji and Pratik Chaudhari , booktitle =. Language Modeling with Learned Meta-Tokens , url =

[7] [9]

Probability Distributions Computed by Autoregressive Transformers , url =

Andy Yang and Anej Svete and Jiaoda Li and Anthony Widjaja Lin and Jonathan Rawski and Ryan Cotterell and David Chiang , booktitle =. Probability Distributions Computed by Autoregressive Transformers , url =

[8] [10]

The Transformer Cookbook , url =

Andy Yang and Christopher Watson and Anton Xue and Satwik Bhattamishra and Jose Llarena and William Merrill and Emile Dos Santos Ferreira and Anej Svete and David Chiang , issn =. The Transformer Cookbook , url =. Transactions on Machine Learning Research , note =

[9] [11]

On the Reasoning Abilities of Masked Diffusion Language Models , url =

Anej Svete and Ashish Sabharwal , booktitle =. On the Reasoning Abilities of Masked Diffusion Language Models , url =

[10] [13]

Limits of Continuous Chain-of-Thought in Multi-Step and Multi-Chain Reasoning , url =

Ayhan Suleymanzade and Andreas Bergmeister and Stefanie Jegelka , booktitle =. Limits of Continuous Chain-of-Thought in Multi-Step and Multi-Chain Reasoning , url =

[11] [15]

Boyi Zeng and Shixiang Song and Siyuan Huang and Yixuan Wang and He Li and Ziwei He and Xinbing Wang and Zhiyu li and Zhouhan Lin , booktitle =. Ponder

[12] [16]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , url =. arXiv , author =:2408.03314 , journal =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [20]

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , url =

DiJia Su and Hanlin Zhu and Yingchen Xu and Jiantao Jiao and Yuandong Tian and Qinqing Zheng , booktitle =. Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , url =

[14] [21]

and Papailiopoulos, Dimitris , booktitle =

Giannou, Angeliki and Rajput, Shashank and Sohn, Jy-Yong and Lee, Kangwook and Lee, Jason D. and Papailiopoulos, Dimitris , booktitle =. Looped Transformers as Programmable Computers , url =

[15] [22]

Neural Networks and the Chomsky Hierarchy , url =

Gregoire Deletang and Anian Ruoss and Jordi Grau-Moya and Tim Genewein and Li Kevin Wenliang and Elliot Catt and Chris Cundy and Marcus Hutter and Shane Legg and Joel Veness and Pedro A Ortega , booktitle =. Neural Networks and the Chomsky Hierarchy , url =

[16] [23]

arXiv , author =:2502.12214 , journal =

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement , url =. arXiv , author =:2502.12214 , journal =

work page arXiv

[17] [25]

Continuous Chain of Thought Enables Parallel Exploration and Reasoning , url =

Halil Alperen Gozeten and Muhammed Emrullah Ildiz and Xuechen Zhang and Hrayr Harutyunyan and Ankit Singh Rawat and Samet Oymak , booktitle =. Continuous Chain of Thought Enables Parallel Exploration and Reasoning , url =

[18] [26]

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought , url =

Hanlin Zhu and Shibo Hao and Zhiting Hu and Jiantao Jiao and Stuart Russell and Yuandong Tian , booktitle =. Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought , url =

[19] [27]

LaDiR: Latent Diffusion Enhances

Haoqiang Kang and Yizhe Zhang and Nikki Lijing Kuang and Nicklas Majamaki and Navdeep Jaitly and Yian Ma and Lianhui Qin , booktitle =. LaDiR: Latent Diffusion Enhances

[20] [32]

Let's Verify Step by Step

Let's Verify Step by Step , url =. arXiv , author =:2305.20050 , journal =

work page internal anchor Pith review Pith/arXiv arXiv

[21] [35]

arXiv , author =:2604.02051 , journal =

Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation , url =. arXiv , author =:2604.02051 , journal =

work page arXiv

[22] [36]

Bowman , booktitle =

Jacob Pfau and William Merrill and Samuel R. Bowman , booktitle =. Let

[23] [38]

Unique Hard Attention: A Tale of Two Sides , url =

Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan , booktitle =. Unique Hard Attention: A Tale of Two Sides , url =. doi:10.18653/v1/2025.acl-short.76 , editor =

work page doi:10.18653/v1/2025.acl-short.76 2025

[24] [44]

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connections to Diffusion Language Models , url =

Jonas Geiping and Xinyu Yang and Guinan Su , booktitle =. Efficient Parallel Samplers for Recurrent-Depth Models and Their Connections to Diffusion Language Models , url =

[25] [49]

A Formal Comparison Between Chain-of-Thought and Latent Thought , url =

Kevin Xu and Issei Sato , journal =. A Formal Comparison Between Chain-of-Thought and Latent Thought , url =

[26] [52]

Deliberation in Latent Space via Differentiable Cache Augmentation , url =

Luyang Liu and Jonas Pfeiffer and Jiaxing Wu and Jun Xie and Arthur Szlam , booktitle =. Deliberation in Latent Space via Differentiable Cache Augmentation , url =

[27] [54]

Dynamic Parameter Reuse Augments Reasoning via Latent Chain of Thought , url =

Maile, Kaitlin and Sacramento, João , booktitle =. Dynamic Parameter Reuse Augments Reasoning via Latent Chain of Thought , url =

[28] [55]

The Illusion of Superposition in Latent CoT via Soft Thinking , url =

Michael Rizvi-Martel and Marius Mosbach , booktitle =. The Illusion of Superposition in Latent CoT via Soft Thinking , url =

[29] [57]

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages , year =

Nadav Borenstein and Anej Svete and Robin Chan and Josef Valvoda and Franz Nowak and Isabelle Augenstein and Eleanor Chodroff and Ryan Cotterell , month = aug, publisher =. What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages , year =

[30] [58]

s1: Simple test-time scaling

s1: Simple test-time scaling , url =. arXiv , author =:2501.19393 , journal =

work page internal anchor Pith review Pith/arXiv arXiv

[31] [59]

Reddi , booktitle =

Nikunj Saunshi and Nishanth Dikkala and Zhiyuan Li and Sanjiv Kumar and Sashank J. Reddi , booktitle =. Reasoning with Latent Thoughts: On the Power of Looped Transformers , url =

[32] [62]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , title =

R. The Thirty-ninth Annual Conference on Neural Information Processing Systems , title =

[33] [63]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =

R. The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =

[34] [65]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =

Robin Chan and Reda Boumasmoud and Anej Svete and Yuxin Ren and Qipeng Guo and Zhijing Jin and Shauli Ravfogel and Mrinmaya Sachan and Bernhard Sch. The Thirty-eighth Annual Conference on Neural Information Processing Systems , title =

[35] [69]

arXiv , author =:2311.04329 , primaryclass =

Formal Aspects of Language Modeling , url =. arXiv , author =:2311.04329 , primaryclass =

work page arXiv

[36] [70]

Think before you speak: Training Language Models With Pause Tokens , url =

Sachin Goyal and Ziwei Ji and Ankit Singh Rawat and Aditya Krishna Menon and Sanjiv Kumar and Vaishnavh Nagarajan , booktitle =. Think before you speak: Training Language Models With Pause Tokens , url =

[37] [71]

Clair and Paul Fodor and Chihiro Shibata and Jeffrey Heinz , journal =

Sam van der Poel and Dakotah Lambert and Kalina Kostyszyn and Tiantian Gao and Rahul Verma and Derek Andersen and Joanne Chau and Emily Peterson and Cody St. Clair and Paul Fodor and Chihiro Shibata and Jeffrey Heinz , journal =. MLRegTest: A Benchmark for the Machine Learning of Regular Languages , url =

[38] [72]

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise Lo

Sangmin Bae and Adam Fisch and Hrayr Harutyunyan and Ziwei Ji and Seungyeon Kim and Tal Schuster , booktitle =. Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise Lo

[39] [73]

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , url =

Sangmin Bae and Yujin Kim and Reza Bayat and Sungnyun Kim and Jiyoun Ha and Tal Schuster and Adam Fisch and Hrayr Harutyunyan and Ziwei Ji and Aaron Courville and Se-Young Yun , booktitle =. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , url =

[40] [77]

Training Large Language Models to Reason in a Continuous Latent Space , url =

Shibo Hao and Sainbayar Sukhbaatar and DiJia Su and Xian Li and Zhiting Hu and Jason E Weston and Yuandong Tian , booktitle =. Training Large Language Models to Reason in a Continuous Latent Space , url =

[41] [80]

Targeted Syntactic Evaluation on the

Someya, Taiga and Yoshida, Ryo and Oseki, Yohei , booktitle =. Targeted Syntactic Evaluation on the

[42] [81]

What Formal Languages Can Transformers Express? A Survey , url =

Strobl, Lena and Merrill, William and Weiss, Gail and Chiang, David and Angluin, Dana , doi =. What Formal Languages Can Transformers Express? A Survey , url =. Transactions of the Association for Computational Linguistics , pages =

[43] [83]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[44] [84]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , booktitle =. Chain-of-thought prompting elicits reasoning in large language models , year =

[45] [85]

Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization , url =

Wengao Ye and Yan Liang and Lianlei Shan , booktitle =. Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization , url =

[46] [86]

Think Silently, Think Fast: Dynamic Latent Compression of

Wenhui Tan and Jiaze Li and Jianzhong Ju and Zhenbo Luo and Ruihua Song and Jian Luan , booktitle =. Think Silently, Think Fast: Dynamic Latent Compression of

[47] [87]

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers , url =

William Merrill and Ashish Sabharwal , booktitle =. A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers , url =

[48] [88]

Exact Expressive Power of Transformers with Padding , url =

William Merrill and Ashish Sabharwal , booktitle =. Exact Expressive Power of Transformers with Padding , url =

[49] [89]

The Expressive Power of Transformers with Chain of Thought , url =

William Merrill and Ashish Sabharwal , booktitle =. The Expressive Power of Transformers with Chain of Thought , url =

[50] [94]

Guiding Language Model Reasoning with Planning Tokens , url =

Xinyi Wang and Lucas Caccia and Oleksiy Ostapenko and Xingdi Yuan and William Yang Wang and Alessandro Sordoni , booktitle =. Guiding Language Model Reasoning with Planning Tokens , url =

[51] [97]

AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures , url =

Yihang Gao and Chuanyang Zheng and Enze Xie and Han Shi and Tianyang Hu and Yu Li and Michael Ng and Zhenguo Li and Zhaoqiang Liu , issn =. AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures , url =. Transactions on Machine Learning Research , note =

[52] [98]

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy , url =. arXiv , author =:2604.02709 , journal =

work page internal anchor Pith review Pith/arXiv arXiv

[53] [99]

Looped Transformers for Length Generalization , url =

Ying Fan and Yilun Du and Kannan Ramchandran and Kangwook Lee , booktitle =. Looped Transformers for Length Generalization , url =

[54] [101]

SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens , url =

Yinhan He and Wendy Zheng and Yaochen Zhu and Zaiyi Zheng and Lin Su and Sriram Vasudevan and Qi Guo and Liangjie Hong and Jundong Li , booktitle =. SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens , url =

[55] [102]

Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , url =

Yu, Qifan and He, Zhenyu and Li, Sijie and Xun, Zhou and Zhang, Jun and Xu, Jingjing and He, Di , booktitle =. Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , url =. doi:10.18653/v1/2026.eacl-long.97 , editor =

work page doi:10.18653/v1/2026.eacl-long.97 2026

[56] [106]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =

Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , booktitle =. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =

[57] [107]

Language models are unsupervised multitask learners , author=

[58] [109]

Unlocking out-of-distribution generalization in transformers via recursive latent space reasoning

Awni Altabaa, Siyu Chen, John Lafferty, and Zhuoran Yang. Unlocking out-of-distribution generalization in transformers via recursive latent space reasoning. arXiv preprint arXiv:2510.14095, 2025. URL https://arxiv.org/abs/2510.14095

work page arXiv 2025

[59] [110]

Latent reasoning with supervised thinking states

Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, and Idan Szpektor. Latent reasoning with supervised thinking states. arXiv preprint arXiv:2602.08332, 2026. URL https://arxiv.org/abs/2602.08332

work page arXiv 2026

[60] [111]

Relaxed recursive transformers: Effective parameter sharing with layer-wise lo RA

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lo RA . In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=WwpYSOkkCt

2025

[61] [112]

Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview...

2025

[62] [113]

Thinking deeper, not longer: Depth-recurrent transformers for compositional generalization

Hung-Hsuan Chen. Thinking deeper, not longer: Depth-recurrent transformers for compositional generalization. arXiv preprint arXiv:2603.21676, 2026. URL https://arxiv.org/abs/2603.21676

work page arXiv 2026

[63] [114]

arXiv preprint arXiv:2505.16782 (2025)

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning. arXiv preprint arXiv:2505.16782, 2025 a . URL https://arxiv.org/abs/2505.16782

work page arXiv 2025

[64] [115]

Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting o...

work page doi:10.18653/v1/2025.acl-long.1369 2025

[65] [116]

Think consistently, reason efficiently: Energy-based calibration for implicit chain-of-thought

Zhikang Chen, Sen Cui, Deheng Ye, Yu Zhang, Yatao Bian, and Tingting Zhu. Think consistently, reason efficiently: Energy-based calibration for implicit chain-of-thought. arXiv preprint arXiv:2511.07124, 2025 c . URL https://arxiv.org/abs/2511.07124

work page arXiv 2025

[66] [117]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171, 2024. URL https://arxiv.org/abs/2412.13171

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [118]

Spot: Span-level pause-of-thought for efficient and interpretable latent reasoning in large language models

Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin, Jialu Wang, and Ruijie Wang. Spot: Span-level pause-of-thought for efficient and interpretable latent reasoning in large language models. arXiv preprint arXiv:2603.06222, 2026. URL https://arxiv.org/abs/2603.06222

work page arXiv 2026

[68] [119]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[69] [120]

Mo EUT : Mixture-of-experts universal transformers

R \'o bert Csord \'a s, Kazuki Irie, J \"u rgen Schmidhuber, Christopher Potts, and Christopher D Manning. Mo EUT : Mixture-of-experts universal transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=ZxVrkm7Bjl

2024

[70] [121]

Do language models use their depth efficiently? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

R \'o bert Csord \'a s, Christopher D Manning, and Christopher Potts. Do language models use their depth efficiently? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=Kz6eUL86XP

2025

[71] [122]

How do latent reasoning methods perform under weak and strong supervision? arXiv preprint arXiv:2602.22441, 2026

Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu, Rui Sun, Zhiji Liu, Yue Xing, Jiliang Tang, and Benoit Dumoulin. How do latent reasoning methods perform under weak and strong supervision? arXiv preprint arXiv:2602.22441, 2026. URL https://arxiv.org/abs/2602.22441

work page arXiv 2026

[72] [123]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645 0 (8081): 0 633--638, Sep 2025. ISSN 1476-4687. doi:10.1038/s41586-025-09422-z. URL https://doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[73] [124]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2019. URL https://arxiv.org/abs/1807.03819

work page internal anchor Pith review Pith/arXiv arXiv 2019

[74] [125]

Llm latent reasoning as chain of superposition

Jingcheng Deng, Liang Pang, Zihao Wei, Shicheng Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Llm latent reasoning as chain of superposition. arXiv preprint arXiv:2510.15522, 2026. URL https://arxiv.org/abs/2510.15522

work page arXiv 2026

[75] [126]

Implicit chain of thought reasoning via knowledge distillation

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

work page arXiv 2023

[76] [127]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=2edigk8yoU

2025

[77] [128]

Efficient reasoning models: A survey

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey. arXiv preprint arXiv:2504.10903, 2025. URL https://arxiv.org/abs/2504.10903

work page arXiv 2025

[78] [129]

SeLaR: Selective Latent Reasoning in Large Language Models

Renyu Fu and Guibo Luo. Selar: Selective latent reasoning in large language models. arXiv preprint arXiv:2604.08299, 2026. URL https://arxiv.org/abs/2604.08299

work page internal anchor Pith review Pith/arXiv arXiv 2026

[79] [130]

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, and Yu Wang. Think-at-hard: Selective latent iterations to improve reasoning language models. arXiv preprint arXiv:2511.08577, 2025. URL https://arxiv.org/abs/2511.08577

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [131]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025 a . URL https://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv 2025