Recursive Scaling in Masked Diffusion Models

Alba Carballo-Castro; Julianna Piskorz; Mihaela van der Schaar; Pascal Frossard; Paulius Rauba

arxiv: 2606.18022 · v1 · pith:AMVDFQGNnew · submitted 2026-06-16 · 💻 cs.LG

Recursive Scaling in Masked Diffusion Models

Alba Carballo-Castro , Julianna Piskorz , Paulius Rauba , Mihaela van der Schaar , Pascal Frossard This is my paper

Pith reviewed 2026-06-27 01:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords masked diffusion modelsrecursive scalingparameter efficiencysequence generationSudokuCountdowndenoising stepsiterative refinement

0 comments

The pith

Recursive reuse of the same denoising transformer in masked diffusion models matches the performance of models with L times more parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that recursive depth offers a third axis for scaling masked diffusion models beyond adding parameters or denoising steps. By applying the identical transformer repeatedly inside each diffusion step, the approach reuses parameters to refine outputs iteratively and increase effective depth. On structured sequence tasks such as Sudoku and Countdown, models using L recursive iterations reach accuracy levels comparable to non-recursive baselines that contain roughly L times as many parameters. The same recursion also reduces the number of denoising steps needed at inference while preserving output quality. These findings position recursive application as a practical mechanism for improving both parameter efficiency and test-time compute allocation in MDMs.

Core claim

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth by repeatedly applying the same denoising transformer within each diffusion step. This produces iterative refinement of the generated sequence through parameter reuse, increasing effective model depth without increasing the parameter count. Across Sudoku and Countdown tasks, an R-MDM with L recursive iterations matches the performance of non-recursive baselines that use roughly L times more parameters. Recursive refinement can also substitute for additional denoising steps, allowing the same generation quality to be reached with fewer forward passes during inference.

What carries the argument

Recursive depth, implemented as repeated application of the identical denoising transformer inside each diffusion step to produce iterative refinement via parameter reuse.

If this is right

An R-MDM with L recursive iterations reaches performance comparable to a non-recursive model with L times more parameters on Sudoku and Countdown.
Recursive refinement allows the same generation quality to be obtained with fewer denoising steps at inference time.
Recursive depth increases effective model depth without any increase in parameter count.
Recursive scaling supplies a third axis alongside parameter count and denoising steps for improving MDM performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recursive mechanism may allow training larger effective models on hardware with limited memory by avoiding simultaneous storage of L distinct parameter sets.
Adaptive choice of recursion depth per input could further optimize the trade-off between quality and inference cost.
The observed substitution between recursion and denoising steps suggests that total test-time compute can be reallocated between depth and step count.

Load-bearing premise

That repeated application of the identical denoising transformer within a diffusion step produces stable iterative refinement without introducing compounding errors or mode collapse on the target tasks.

What would settle it

If a recursive model with L=4 iterations on the Sudoku task performs worse than a non-recursive baseline with exactly 4 times as many parameters, the central efficiency claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.18022 by Alba Carballo-Castro, Julianna Piskorz, Mihaela van der Schaar, Pascal Frossard, Paulius Rauba.

**Figure 1.** Figure 1: Standard capability gains often come from scaling model size, whereas our approach improves generation by looping an MDLM over its own predictions. The model predicts all tokens in parallel and then iteratively refines the sequence, correcting early mistakes and enabling higher-quality samples with fewer decoding steps. cursion potentially more effective per iteration, since each loop can refine all token … view at source ↗

**Figure 2.** Figure 2: Comparison between the standard one-pass MDLM and the recursive variant. The recursive model reuses the same transformer blocks across refinement steps, optionally conditioned on a step embedding. At inference time, there is a recursive loop within each denoising step to refine predictions. 3.2. Recursive Architecture The denoising network of a standard MDM applies a KLlayer transformer once to xt and dec… view at source ↗

**Figure 3.** Figure 3: Effect of recursive refinement depth (L) on task performance for Sudoku 9 × 9 and Countdown across different target lengths (3, 4, and 5 digits). Increasing L consistently improves success rates, with the strongest gains for more difficult problems. 1 5 10 25 50 100 Denoising steps T 0 25 50 75 100 Valid puzzle rate (%) Sudoku 25x25 - 70% masked 1 5 10 25 50 100 Denoising steps T 0 25 50 75 100 Sudoku 25x2… view at source ↗

**Figure 4.** Figure 4: Effect of recursive refinement depth (fixed L at training and sampling) on 25 × 25 Sudoku reconstruction under different masking conditions. Across all masking regimes, recursion improves valid puzzle recovery. For Sudoku 25 × 25, we observe that the single-pass baseline fails almost entirely at 80% and 90% masking (6.6% and 0% valid at T=5), while the model trained and evaluated with fixed L=3 achieves … view at source ↗

**Figure 5.** Figure 5: Effect of recursion at matched effective model depth (iso-FLOP) for Sudoku 9 × 9 and Countdown tasks of different target length (3, 4, and 5 digits). The benefit of looping is clearest on Countdown-4 (medium). At T=20, (3⊗5) and (3⊗10) reach 70.2% and 74.4% RTR, beating the iso-FLOP 15-layer model (69.4%) and matching the 30-layer model (69.6%) with 5×–10× fewer parameters. Finally, on Countdown-5 (hard), … view at source ↗

**Figure 6.** Figure 6: Cross-recursion trade-off between training recursion depth (Lt) and sampling recursion depth (Ls) for 9 × 9 Sudoku tasks. The top heatmap shows one-step decoding, while the bottom shows 5 decoding steps. Results [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Parameter-step Pareto frontier between the model’s parameter count and the number of decoding steps T needed to reach 95% VPR in Sudoku 9×9 (top) and 70% RTR in Countdown 4 (bottom). Recursive models (trained and evaluated with fixed steps) outperform non-recursive variants with larger parameter count. Results. We observe that recursive models consistently reach the target performance at a much smaller par… view at source ↗

**Figure 9.** Figure 9: Performance comparison between models trained with different recursive steps choices on 9 × 9 Sudoku puzzles. We report the Valid Puzzle Rate across varying loop counts at sampling Ls and denoising steps T. D.2. Training objectives Here we extend the discussion on the different supervision methods, and report the results from the ablation study. Final-step loss (FINAL). Only the logits produced at the last… view at source ↗

**Figure 10.** Figure 10: Schematic of the different supervision modes considered. where wℓ ≥ 0 and PL ℓ=1 wℓ = 1. In our implementation, we consider two weighting schemes. For linear weighting, wℓ = ℓ α PL j=1 j α , (18) while for exponential weighting, wℓ = exp α(ℓ − L) PL j=1 exp α(j − L) , (19) where α > 0 controls how strongly supervision is concentrated toward later loops. Larger values of α place increasing emphasis on … view at source ↗

**Figure 11.** Figure 11: Performance comparison between supervision losses on 9 × 9 Sudoku puzzles for the model trained with Lt = 5 recursive steps. We report the Valid Puzzle Rate across varying loop counts at sampling Ls and denoising steps T. FINAL is notable. This supports the view that dense, per-step supervision is necessary to align every recursive layer with the denoising objective, rather than relying on error signals t… view at source ↗

**Figure 12.** Figure 12: shows a relatively close comparison between learned embeddings and the no-embedding baseline, with both outperforming fixed embeddings in the most important low-depth regimes. Learned embeddings perform strongly across the sweep and give the model an explicit, trainable signal for the current recursion step, which makes them a natural generalpurpose choice. However, the no-embedding baseline is also very… view at source ↗

read the original abstract

Masked diffusion models (MDMs) have recently emerged as a promising paradigm for sequence generation. Scaling MDMs is conventionally achieved by increasing the parameter count or the number of denoising steps. We introduce Recursive Masked Diffusion Models (R-MDMs), which add recursive depth as a third scaling axis by repeatedly applying the same denoising transformer within each diffusion step. Recursion enables iterative refinement of the output through parameter reuse, increasing effective model depth without increasing parameter count. Across structured generation tasks, including Sudoku and Countdown, we show that R-MDMs achieve substantially improved parameter efficiency: a model with $L$ recursive iterations often matches the performance of non-recursive baselines with roughly $L\times$ more parameters. Moreover, recursive refinement can partially substitute for additional denoising steps, allowing recursive models to reach the same generation quality with fewer forward passes at inference time. These results suggest that recursive depth is a practically useful scaling mechanism for MDMs, improving both parameter efficiency and the allocation of test-time compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Recursive depth via parameter reuse is a workable third scaling axis for masked diffusion models on structured tasks like Sudoku.

read the letter

The main point is that this paper adds recursion inside each diffusion step by reusing the same denoising transformer multiple times. This creates effective depth without new parameters and shows up as better efficiency on the tested tasks.

What stands out as new is treating recursive iterations as an explicit scaling dimension alongside width, depth, or step count. The work demonstrates that L recursive passes often match the performance of non-recursive models with roughly L times more parameters. It also shows recursion can substitute for some denoising steps, cutting inference passes while keeping quality.

The paper does a solid job laying out the empirical case on Sudoku and Countdown. The results line up with the claim of improved parameter efficiency and better test-time compute use. The framing is direct and the mechanism is simple to understand.

Soft spots are limited. The tasks are narrow puzzle domains, so it is unclear how far the gains extend to text, images, or less structured data. The abstract gives no implementation specifics or ablations on iteration count or stability, which leaves some room for questions about whether repeated application stays robust. That said, the reported consistency across tasks suggests the experiments already checked for obvious divergence or collapse.

This is for researchers working on efficient scaling of masked diffusion models for structured generation. A reader focused on parameter-efficient or inference-aware generative modeling would get concrete value from the efficiency numbers. It deserves peer review because it introduces a distinct scaling idea with supporting results on the tasks, even if wider domains would strengthen the case.

Referee Report

2 major / 2 minor

Summary. The paper introduces Recursive Masked Diffusion Models (R-MDMs) that scale masked diffusion models along a third axis by repeatedly applying the identical denoising transformer within each diffusion step. The central empirical claim is that L recursive iterations on a fixed model often match the performance of non-recursive baselines with roughly L× more parameters on structured tasks such as Sudoku and Countdown, while recursive refinement can also trade off against the number of denoising steps to reach equivalent quality with fewer forward passes.

Significance. If the reported efficiency gains prove robust, the work identifies recursive depth as a practical scaling mechanism for MDMs that improves parameter efficiency and test-time compute allocation through parameter reuse. This is a straightforward empirical contribution with potential applicability to other structured generation settings, though its significance would increase with explicit controls for total compute and validation beyond the two named tasks. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described.

major comments (2)

[Experiments] The central claim rests on the stability of repeated identical transformer application without compounding errors or mode collapse. The experiments section must include ablations that track output divergence or quality degradation as a function of recursion depth L, together with statistical significance across multiple runs, to substantiate that the observed matching to L×-parameter baselines is not an artifact of task-specific stability.
[Experiments] Table or figure reporting the main Sudoku/Countdown results: the claim of 'roughly L× more parameters' requires explicit reporting of total parameter counts, FLOPs per forward pass, and whether the non-recursive baselines were trained with equivalent total compute; without these, the parameter-efficiency conclusion cannot be isolated from possible confounds in training budget.

minor comments (2)

[Introduction] The abstract states results hold 'across structured generation tasks' yet names only two; the introduction or related-work section should clarify the precise scope of tasks evaluated and any negative results on additional domains.
Notation for the recursion operator and its integration inside a diffusion step should be defined once with a clear equation or pseudocode block to avoid ambiguity when comparing recursive and non-recursive forward passes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments. We address each major comment below and plan to revise the manuscript accordingly to strengthen the experimental section.

read point-by-point responses

Referee: [Experiments] The central claim rests on the stability of repeated identical transformer application without compounding errors or mode collapse. The experiments section must include ablations that track output divergence or quality degradation as a function of recursion depth L, together with statistical significance across multiple runs, to substantiate that the observed matching to L×-parameter baselines is not an artifact of task-specific stability.

Authors: We agree with the importance of verifying stability under recursion. In the revised version, we will add ablations that plot quality metrics versus recursion depth L, include measures of output divergence, and report means and standard deviations over at least 5 independent runs with different seeds to demonstrate statistical reliability. revision: yes
Referee: [Experiments] Table or figure reporting the main Sudoku/Countdown results: the claim of 'roughly L× more parameters' requires explicit reporting of total parameter counts, FLOPs per forward pass, and whether the non-recursive baselines were trained with equivalent total compute; without these, the parameter-efficiency conclusion cannot be isolated from possible confounds in training budget.

Authors: We will revise the manuscript to include a dedicated table or section explicitly listing the total parameter counts for recursive and non-recursive models, approximate FLOPs per forward pass, and details on the training compute budget used for all baselines to ensure fair comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical demonstration of recursive masked diffusion models on structured generation tasks. No equations, derivations, or first-principles claims are present in the provided text or abstract. The central results consist of experimental comparisons showing that L recursive iterations match non-recursive baselines with ~L× parameters; these are direct observations from training and evaluation runs rather than quantities that reduce to fitted inputs or self-citations by construction. No load-bearing self-citation chains, ansatzes, or uniqueness theorems are invoked.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities; ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5713 in / 979 out tokens · 24792 ms · 2026-06-27T01:41:11.769617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 10 linked inside Pith

[1]

Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , title =

Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[2]

Chiu and Alexander Rush and Volodymyr Kuleshov , title =

Subham Sekhar Sahoo and Marianne Arriola and Yair Schiff and Aaron Gokaslan and Edgar Marroquin and Justin T. Chiu and Alexander Rush and Volodymyr Kuleshov , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Shen Nie and Fengqi Zhu and Zebin You and Xiaolu Zhang and Jingyang Ou and Jun Hu and Jun Zhou and Yankai Lin and Ji-Rong Wen and Chongxuan Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[4]

2025 , journal =

Dream 7B: Diffusion Large Language Models , author=. 2025 , journal =

2025
[5]

2026 , journal =

Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs , author=. 2026 , journal =

2026
[6]

Promises, Outlooks and Challenges of

Justin Deschenaux and Caglar Gulcehre , year=. Promises, Outlooks and Challenges of. arXiv preprint arXiv:2406.11473 , eprint=

arXiv
[7]

International Conference on Learning Representations (ICLR) , year=

On the Reasoning Abilities of Masked Diffusion Language Models , author=. International Conference on Learning Representations (ICLR) , year=
[8]

Universal Transformers , journal =

Mostafa Dehghani and Stephan Gouws and Oriol Vinyals and Jakob Uszkoreit and. Universal Transformers , journal =. 2019 , url =

2019
[9]

International Conference on Learning Representations (ICLR) , year =

Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut , title =. International Conference on Learning Representations (ICLR) , year =
[10]

Lee and Dimitris Papailiopoulos , title =

Angeliki Giannou and Shashank Rajput and Jy-yong Sohn and Kangwook Lee and Jason D. Lee and Dimitris Papailiopoulos , title =. International Conference on Machine Learning (ICML) , year =
[11]

Reddi , title =

Nikunj Saunshi and Nishanth Dikkala and Zhiyuan Li and Sanjiv Kumar and Sashank J. Reddi , title =. International Conference on Learning Representations (ICLR) , year =
[12]

arXiv preprint arXiv:2502.08482 , year =

Qifan Yu and Zhenyu He and Sijie Li and Xun Zhou and Jun Zhang and Jingjing Xu and Di He , title =. arXiv preprint arXiv:2502.08482 , year =

arXiv
[13]

arXiv preprint arXiv:2603.08082 , year =

Paulius Rauba and Claudio Fanconi and Mihaela van der Schaar , title =. arXiv preprint arXiv:2603.08082 , year =

arXiv
[14]

2025 , journal =

Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , journal =

2025
[15]

arXiv preprint arXiv:2506.21734 , year =

Guan Wang and Jin Li and Yuhao Sun and Xing Chen and Changling Liu and Yue Wu and Meng Lu and Sen Song and Yasin Abbasi Yadkori , title =. arXiv preprint arXiv:2506.21734 , year =

Pith/arXiv arXiv
[16]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jonathan Ho and Ajay Jain and Pieter Abbeel , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[17]

Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole , title =

Yang Song and Jascha Sohl-Dickstein and Diederik P. Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole , title =. International Conference on Learning Representations (ICLR) , year =
[18]

and Maheswaranathan, Niru and Ganguli, Surya , journal =

Sohl-Dickstein, Jascha and Weiss, Eric A. and Maheswaranathan, Niru and Ganguli, Surya , journal =. Deep
[19]

Gomez and

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and. Attention Is All You Need , year =. Advances in Neural Information Processing Systems (NeurIPS) , url =
[20]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[21]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Guhao Feng and Bohang Zhang and Yuntian Gu and Haotian Ye and Di He and Liwei Wang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[22]

Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. arXiv preprint arXiv:2001.08361 , year =

Pith/arXiv arXiv 2001
[23]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Rasmus Berg Palm and Ulrich Paquet and Ole Winther , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[24]

Zico Kolter and Vladlen Koltun , title =

Shaojie Bai and J. Zico Kolter and Vladlen Koltun , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[25]

2014 , journal=

Deeply-Supervised Nets , author=. 2014 , journal=

2014
[26]

arXiv preprint arXiv:1603.08983 , year =

Alex Graves , title =. arXiv preprint arXiv:1603.08983 , year =

Pith/arXiv arXiv
[27]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Biao Zhang and Rico Sennrich , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[28]

Weinberger , title =

Gao Huang and Yu Sun and Zhuang Liu and Daniel Sedra and Kilian Q. Weinberger , title =. European Conference on Computer Vision (ECCV) , year =
[29]

arXiv preprint arXiv:1608.05859 , year =

Ofir Press and Lior Wolf , title =. arXiv preprint arXiv:1608.05859 , year =

Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2409.10502 , url=

Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles , author=. arXiv preprint arXiv:2409.10502 , url=. 2024 , eprint=

arXiv 2024
[31]

International Conference on Machine Learning (ICML) , year =

Aaron Lou and Chenlin Meng and Stefano Ermon , title =. International Conference on Machine Learning (ICML) , year =
[32]

International Conference on Machine Learning (ICML) , year =

Andrew Campbell and Jason Yim and Regina Barzilay and Tom Rainforth and Tommi Jaakkola , title =. International Conference on Machine Learning (ICML) , year =
[33]

arXiv preprint arXiv:2511.21338 , year =

Julianna Piskorz and Cristina Pinneri and Alvaro Correia and Motasem Alfarra and Risheek Garrepalli and Christos Louizos , title =. arXiv preprint arXiv:2511.21338 , year =

Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2602.15014 , year =

Subham Sekhar Sahoo and Jean-Marie Lemercier and Zhihan Yang and Justin Deschenaux and Jingyu Liu and John Thickstun and Ante Jukic , title =. arXiv preprint arXiv:2602.15014 , year =

arXiv
[35]

Chiu and Volodymyr Kuleshov , title =

Subham Sekhar Sahoo and Justin Deschenaux and Aaron Gokaslan and Guanghan Wang and Justin T. Chiu and Volodymyr Kuleshov , title =. International Conference on Machine Learning (ICML) , year =
[36]

Krishnan , title =

Chen-Hao Chao and Wei-Fang Sun and Hanwen Liang and Chun-Yi Lee and Rahul G. Krishnan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[37]

arXiv preprint arXiv:2602.23968 , year =

David Fox and Sam Bowyer and Song Liu and Laurence Aitchison and Raul Santos-Rodriguez and Mengyue Yang , title =. arXiv preprint arXiv:2602.23968 , year =

arXiv
[38]

Naesseth and Grigory Bartosh , title =

Nesta Midavaine and Christian A. Naesseth and Grigory Bartosh , title =. EuRIPs 2025 Workshop on Principles of Generative Modeling , year =

2025
[39]

International Conference on Learning Representations (ICLR) , year =

Huangjie Zheng and Shansan Gong and Ruixiang Zhang and Tianrong Chen and Jiatao Gu and Mingyuan Zhou and Navdeep Jaitly and Yizhe Zhang , title =. International Conference on Learning Representations (ICLR) , year =
[40]

Boffi and Jinwoo Kim , title =

Chanhyuk Lee and Jaehoon Yoo and Manan Agarwal and Sheel Shah and Jerry Huang and Aditi Raghunathan and Seunghoon Hong and Nicholas M. Boffi and Jinwoo Kim , title =. arXiv preprint arXiv:2602.16813 , year =

Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2510.03206 , year =

Cai Zhou and Chenxiao Yang and Yi Hu and Chenyu Wang and Chubin Zhang and Muhan Zhang and Lester Mackey and Tommi Jaakkola and Stephen Bates and Dinghuai Zhang , title =. arXiv preprint arXiv:2510.03206 , year =

Pith/arXiv arXiv
[42]

International Conference on Learning Representations (ICLR) , year =

Jiacheng Ye and Jiahui Gao and Shansan Gong and Lin Zheng and Xin Jiang and Zhenguo Li and Lingpeng Kong , title =. International Conference on Learning Representations (ICLR) , year =
[43]

arXiv preprint arXiv:2602.03769 , year =

Andre He and Sean Welleck and Daniel Fried , title =. arXiv preprint arXiv:2602.03769 , year =

arXiv
[44]

arXiv preprint arXiv:2509.25239 , year =

Kevin Xu and Issei Sato , title =. arXiv preprint arXiv:2509.25239 , year =

Pith/arXiv arXiv
[45]

Bartoldson and Bhavya Kailkhura and Abhinav Bhatele and Tom Goldstein , title =

Jonas Geiping and Sean McLeish and Neel Jain and John Kirchenbauer and Siddharth Singh and Brian R. Bartoldson and Bhavya Kailkhura and Abhinav Bhatele and Tom Goldstein , title =. arXiv preprint arXiv:2502.05171 , year =

Pith/arXiv arXiv
[46]

arXiv preprint arXiv:2604.12946 , url=

Parcae: Scaling Laws For Stable Looped Language Models , author=. arXiv preprint arXiv:2604.12946 , url=. 2026 , eprint=

Pith/arXiv arXiv 2026
[47]

arXiv preprint arXiv:2409.15647 , year =

Ying Fan and Yilun Du and Kannan Ramchandran and Kangwook Lee , title =. arXiv preprint arXiv:2409.15647 , year =

arXiv
[48]

arXiv preprint arXiv:2311.12424 , year =

Liu Yang and Kangwook Lee and Robert Nowak and Dimitris Papailiopoulos , title =. arXiv preprint arXiv:2311.12424 , year =

arXiv
[49]

2023 , journal =

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , journal =

2023
[50]

Goodman , year=

Kanishk Gandhi and Denise Lee and Gabriel Grand and Muxin Liu and Winson Cheng and Archit Sharma and Noah D. Goodman , year=. Stream of Search (. arXiv preprint arXiv:2404.03683 , url=

arXiv
[51]

2023 , journal =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. 2023 , journal =

2023
[52]

2021 , license=

Wang, Ben and Komatsuzaki, Aran , title=. 2021 , license=

2021
[53]

2006 , url =

Matt Mahoney , title =. 2006 , url =

2006
[54]

2023 , journal=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , journal=

2023
[55]

2026 , journal=

Categorical Flow Maps , author=. 2026 , journal=

2026
[56]

Yu, Chengting and Shu, Xiaobo and Wang, Yadao and Zhang, Yizhen and Wu, Haoyi and Wu, You and Long, Rujiao and Chen, Ziheng and Xu, Yuchi and Su, Wenbo and Zheng, Bo , year =
[57]

Jeddi, Ahmadreza and Ciccone, Marco and Taati, Babak , year=
[58]

Bai, Xingjian and Melas-Kyriazi, Luke , year =. Fixed
[59]

Zico , year=

Geng, Zhengyang and Pokle, Ashwini and Kolter, J. Zico , year=. One-
[60]

Bae, Sangmin and Fisch, Adam and Harutyunyan, Hrayr and Ji, Ziwei and Kim, Seungyeon and Schuster, Tal , year=. Relaxed
[61]

2018 , journal=

Learning Anytime Predictions in Neural Networks via Adaptive Loss Balancing , author=. 2018 , journal=

2018
[62]

2026 , journal=

Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models , author=. 2026 , journal=

2026

[1] [1]

Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , title =

Jacob Austin and Daniel D. Johnson and Jonathan Ho and Daniel Tarlow and Rianne van den Berg , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[2] [2]

Chiu and Alexander Rush and Volodymyr Kuleshov , title =

Subham Sekhar Sahoo and Marianne Arriola and Yair Schiff and Aaron Gokaslan and Edgar Marroquin and Justin T. Chiu and Alexander Rush and Volodymyr Kuleshov , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Shen Nie and Fengqi Zhu and Zebin You and Xiaolu Zhang and Jingyang Ou and Jun Hu and Jun Zhou and Yankai Lin and Ji-Rong Wen and Chongxuan Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[4] [4]

2025 , journal =

Dream 7B: Diffusion Large Language Models , author=. 2025 , journal =

2025

[5] [5]

2026 , journal =

Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs , author=. 2026 , journal =

2026

[6] [6]

Promises, Outlooks and Challenges of

Justin Deschenaux and Caglar Gulcehre , year=. Promises, Outlooks and Challenges of. arXiv preprint arXiv:2406.11473 , eprint=

arXiv

[7] [7]

International Conference on Learning Representations (ICLR) , year=

On the Reasoning Abilities of Masked Diffusion Language Models , author=. International Conference on Learning Representations (ICLR) , year=

[8] [8]

Universal Transformers , journal =

Mostafa Dehghani and Stephan Gouws and Oriol Vinyals and Jakob Uszkoreit and. Universal Transformers , journal =. 2019 , url =

2019

[9] [9]

International Conference on Learning Representations (ICLR) , year =

Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut , title =. International Conference on Learning Representations (ICLR) , year =

[10] [10]

Lee and Dimitris Papailiopoulos , title =

Angeliki Giannou and Shashank Rajput and Jy-yong Sohn and Kangwook Lee and Jason D. Lee and Dimitris Papailiopoulos , title =. International Conference on Machine Learning (ICML) , year =

[11] [11]

Reddi , title =

Nikunj Saunshi and Nishanth Dikkala and Zhiyuan Li and Sanjiv Kumar and Sashank J. Reddi , title =. International Conference on Learning Representations (ICLR) , year =

[12] [12]

arXiv preprint arXiv:2502.08482 , year =

Qifan Yu and Zhenyu He and Sijie Li and Xun Zhou and Jun Zhang and Jingjing Xu and Di He , title =. arXiv preprint arXiv:2502.08482 , year =

arXiv

[13] [13]

arXiv preprint arXiv:2603.08082 , year =

Paulius Rauba and Claudio Fanconi and Mihaela van der Schaar , title =. arXiv preprint arXiv:2603.08082 , year =

arXiv

[14] [14]

2025 , journal =

Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , journal =

2025

[15] [15]

arXiv preprint arXiv:2506.21734 , year =

Guan Wang and Jin Li and Yuhao Sun and Xing Chen and Changling Liu and Yue Wu and Meng Lu and Sen Song and Yasin Abbasi Yadkori , title =. arXiv preprint arXiv:2506.21734 , year =

Pith/arXiv arXiv

[16] [16]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jonathan Ho and Ajay Jain and Pieter Abbeel , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[17] [17]

Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole , title =

Yang Song and Jascha Sohl-Dickstein and Diederik P. Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole , title =. International Conference on Learning Representations (ICLR) , year =

[18] [18]

and Maheswaranathan, Niru and Ganguli, Surya , journal =

Sohl-Dickstein, Jascha and Weiss, Eric A. and Maheswaranathan, Niru and Ganguli, Surya , journal =. Deep

[19] [19]

Gomez and

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and. Attention Is All You Need , year =. Advances in Neural Information Processing Systems (NeurIPS) , url =

[20] [20]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[21] [21]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Guhao Feng and Bohang Zhang and Yuntian Gu and Haotian Ye and Di He and Liwei Wang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[22] [22]

Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. arXiv preprint arXiv:2001.08361 , year =

Pith/arXiv arXiv 2001

[23] [23]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Rasmus Berg Palm and Ulrich Paquet and Ole Winther , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[24] [24]

Zico Kolter and Vladlen Koltun , title =

Shaojie Bai and J. Zico Kolter and Vladlen Koltun , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[25] [25]

2014 , journal=

Deeply-Supervised Nets , author=. 2014 , journal=

2014

[26] [26]

arXiv preprint arXiv:1603.08983 , year =

Alex Graves , title =. arXiv preprint arXiv:1603.08983 , year =

Pith/arXiv arXiv

[27] [27]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Biao Zhang and Rico Sennrich , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[28] [28]

Weinberger , title =

Gao Huang and Yu Sun and Zhuang Liu and Daniel Sedra and Kilian Q. Weinberger , title =. European Conference on Computer Vision (ECCV) , year =

[29] [29]

arXiv preprint arXiv:1608.05859 , year =

Ofir Press and Lior Wolf , title =. arXiv preprint arXiv:1608.05859 , year =

Pith/arXiv arXiv

[30] [30]

arXiv preprint arXiv:2409.10502 , url=

Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles , author=. arXiv preprint arXiv:2409.10502 , url=. 2024 , eprint=

arXiv 2024

[31] [31]

International Conference on Machine Learning (ICML) , year =

Aaron Lou and Chenlin Meng and Stefano Ermon , title =. International Conference on Machine Learning (ICML) , year =

[32] [32]

International Conference on Machine Learning (ICML) , year =

Andrew Campbell and Jason Yim and Regina Barzilay and Tom Rainforth and Tommi Jaakkola , title =. International Conference on Machine Learning (ICML) , year =

[33] [33]

arXiv preprint arXiv:2511.21338 , year =

Julianna Piskorz and Cristina Pinneri and Alvaro Correia and Motasem Alfarra and Risheek Garrepalli and Christos Louizos , title =. arXiv preprint arXiv:2511.21338 , year =

Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2602.15014 , year =

Subham Sekhar Sahoo and Jean-Marie Lemercier and Zhihan Yang and Justin Deschenaux and Jingyu Liu and John Thickstun and Ante Jukic , title =. arXiv preprint arXiv:2602.15014 , year =

arXiv

[35] [35]

Chiu and Volodymyr Kuleshov , title =

Subham Sekhar Sahoo and Justin Deschenaux and Aaron Gokaslan and Guanghan Wang and Justin T. Chiu and Volodymyr Kuleshov , title =. International Conference on Machine Learning (ICML) , year =

[36] [36]

Krishnan , title =

Chen-Hao Chao and Wei-Fang Sun and Hanwen Liang and Chun-Yi Lee and Rahul G. Krishnan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[37] [37]

arXiv preprint arXiv:2602.23968 , year =

David Fox and Sam Bowyer and Song Liu and Laurence Aitchison and Raul Santos-Rodriguez and Mengyue Yang , title =. arXiv preprint arXiv:2602.23968 , year =

arXiv

[38] [38]

Naesseth and Grigory Bartosh , title =

Nesta Midavaine and Christian A. Naesseth and Grigory Bartosh , title =. EuRIPs 2025 Workshop on Principles of Generative Modeling , year =

2025

[39] [39]

International Conference on Learning Representations (ICLR) , year =

Huangjie Zheng and Shansan Gong and Ruixiang Zhang and Tianrong Chen and Jiatao Gu and Mingyuan Zhou and Navdeep Jaitly and Yizhe Zhang , title =. International Conference on Learning Representations (ICLR) , year =

[40] [40]

Boffi and Jinwoo Kim , title =

Chanhyuk Lee and Jaehoon Yoo and Manan Agarwal and Sheel Shah and Jerry Huang and Aditi Raghunathan and Seunghoon Hong and Nicholas M. Boffi and Jinwoo Kim , title =. arXiv preprint arXiv:2602.16813 , year =

Pith/arXiv arXiv

[41] [41]

arXiv preprint arXiv:2510.03206 , year =

Cai Zhou and Chenxiao Yang and Yi Hu and Chenyu Wang and Chubin Zhang and Muhan Zhang and Lester Mackey and Tommi Jaakkola and Stephen Bates and Dinghuai Zhang , title =. arXiv preprint arXiv:2510.03206 , year =

Pith/arXiv arXiv

[42] [42]

International Conference on Learning Representations (ICLR) , year =

Jiacheng Ye and Jiahui Gao and Shansan Gong and Lin Zheng and Xin Jiang and Zhenguo Li and Lingpeng Kong , title =. International Conference on Learning Representations (ICLR) , year =

[43] [43]

arXiv preprint arXiv:2602.03769 , year =

Andre He and Sean Welleck and Daniel Fried , title =. arXiv preprint arXiv:2602.03769 , year =

arXiv

[44] [44]

arXiv preprint arXiv:2509.25239 , year =

Kevin Xu and Issei Sato , title =. arXiv preprint arXiv:2509.25239 , year =

Pith/arXiv arXiv

[45] [45]

Bartoldson and Bhavya Kailkhura and Abhinav Bhatele and Tom Goldstein , title =

Jonas Geiping and Sean McLeish and Neel Jain and John Kirchenbauer and Siddharth Singh and Brian R. Bartoldson and Bhavya Kailkhura and Abhinav Bhatele and Tom Goldstein , title =. arXiv preprint arXiv:2502.05171 , year =

Pith/arXiv arXiv

[46] [46]

arXiv preprint arXiv:2604.12946 , url=

Parcae: Scaling Laws For Stable Looped Language Models , author=. arXiv preprint arXiv:2604.12946 , url=. 2026 , eprint=

Pith/arXiv arXiv 2026

[47] [47]

arXiv preprint arXiv:2409.15647 , year =

Ying Fan and Yilun Du and Kannan Ramchandran and Kangwook Lee , title =. arXiv preprint arXiv:2409.15647 , year =

arXiv

[48] [48]

arXiv preprint arXiv:2311.12424 , year =

Liu Yang and Kangwook Lee and Robert Nowak and Dimitris Papailiopoulos , title =. arXiv preprint arXiv:2311.12424 , year =

arXiv

[49] [49]

2023 , journal =

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , journal =

2023

[50] [50]

Goodman , year=

Kanishk Gandhi and Denise Lee and Gabriel Grand and Muxin Liu and Winson Cheng and Archit Sharma and Noah D. Goodman , year=. Stream of Search (. arXiv preprint arXiv:2404.03683 , url=

arXiv

[51] [51]

2023 , journal =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. 2023 , journal =

2023

[52] [52]

2021 , license=

Wang, Ben and Komatsuzaki, Aran , title=. 2021 , license=

2021

[53] [53]

2006 , url =

Matt Mahoney , title =. 2006 , url =

2006

[54] [54]

2023 , journal=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , journal=

2023

[55] [55]

2026 , journal=

Categorical Flow Maps , author=. 2026 , journal=

2026

[56] [56]

Yu, Chengting and Shu, Xiaobo and Wang, Yadao and Zhang, Yizhen and Wu, Haoyi and Wu, You and Long, Rujiao and Chen, Ziheng and Xu, Yuchi and Su, Wenbo and Zheng, Bo , year =

[57] [57]

Jeddi, Ahmadreza and Ciccone, Marco and Taati, Babak , year=

[58] [58]

Bai, Xingjian and Melas-Kyriazi, Luke , year =. Fixed

[59] [59]

Zico , year=

Geng, Zhengyang and Pokle, Ashwini and Kolter, J. Zico , year=. One-

[60] [60]

Bae, Sangmin and Fisch, Adam and Harutyunyan, Hrayr and Ji, Ziwei and Kim, Seungyeon and Schuster, Tal , year=. Relaxed

[61] [61]

2018 , journal=

Learning Anytime Predictions in Neural Networks via Adaptive Loss Balancing , author=. 2018 , journal=

2018

[62] [62]

2026 , journal=

Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models , author=. 2026 , journal=

2026