Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Andrew McCallum; Avishek Joey Bose; Benjamin Rozonoyer; Dhruvesh Patel; Jacopo Minniti; Neil Band; Tim G. J. Rudner

arxiv: 2605.22967 · v1 · pith:AJFMFUXMnew · submitted 2026-05-21 · 💻 cs.LG

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Benjamin Rozonoyer , Jacopo Minniti , Dhruvesh Patel , Neil Band , Avishek Joey Bose , Tim G. J. Rudner , Andrew McCallum This is my paper

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion language modelsmasked diffusionrelay representationstruncated BPTTinference optimizationcoding tasksdiscrete diffusion

0 comments

The pith

Masked diffusion models can propagate latent information across denoising steps using a learned per-token relay channel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Learned Relay Representations to prevent masked diffusion models from discarding internal computations between refinement steps. Instead of resetting each time, a differentiable channel is learned to pass information forward, trained with truncated backpropagation through time. This is first justified on a Sudoku planning task and then scaled to Fast-dLLM v2, where it outperforms supervised fine-tuning on coding tasks and reduces latency by up to 32 percent. The approach integrates with existing techniques like block diffusion and KV caching.

Core claim

By introducing a differentiable per-token channel trained via truncated BPTT, diffusion language models can explicitly learn to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier when applied to state-of-the-art models like Fast-dLLM v2.

What carries the argument

Learned Relay Representations: a differentiable per-token channel that passes information between forward passes, trained via truncated backpropagation through time.

If this is right

The framework scales to state-of-the-art Diffusion Language Models.
Relay is compatible with block diffusion and KV caching.
It outperforms standard supervised finetuning on coding tasks.
Inference latency is reduced by up to 32 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The relay channel could potentially extend to other iterative generation methods that recompute states at each step.
Sudoku-based training of the relay might act as a proxy task for improving structured reasoning in language models.
Optimizing the channel length or structure could yield further latency gains on longer sequences.

Load-bearing premise

That the relay channel learned via truncated BPTT on a Sudoku task will transfer effectively to language modeling without introducing instability or requiring extensive additional hyperparameter search.

What would settle it

If applying Relay to Fast-dLLM v2 yields no performance gain over standard supervised finetuning or fails to reduce inference latency, the central scalability claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22967 by Andrew McCallum, Avishek Joey Bose, Benjamin Rozonoyer, Dhruvesh Patel, Jacopo Minniti, Neil Band, Tim G. J. Rudner.

**Figure 1.** Figure 1: Schematic of Relay over two consecutive inference steps. At each step k, the backbone fθ consumes the sum of embedded tokens Embθ(xtk ) and the projected relay state Rθ(hk), producing a hidden state hk+1 that is both unembedded into logits for the cross-entropy loss and forwarded through the relay module Rθ (orange path) into the next step. Tokens are progressively unmasked between steps (e.g. [M]→f at s… view at source ↗

**Figure 2.** Figure 2: Accuracy-NFE frontier on Sudoku-Extreme validation. Each curve traces a single training method as we sweep the inference confidence threshold τ ∈ {0.05, 0.10, 0.15, 0.20, 0.25}. A lower τ commits fewer cells per forward pass and so spends more NFEs (rightward), and vice-versa. Shaded ribbons denote ±1 sample standard deviation across three training seeds. that augments each puzzle with a step-by-step solve… view at source ↗

**Figure 3.** Figure 3: GPU memory during one training micro-step of Fast-dLLM v2 on an A100 80GB. Solid lines show the live GPU memory at every decoder-layer forward/backward hook. Dashed lines show the running maximum of live memory within the same micro-step (high-water mark). Phase labels (fwd, fwd2, bwd) mark each phase’s plateau. Relay carries higher live memory through fwd2, but its peak (≈ 20.1GiB) lands within ≈ 1GiB of … view at source ↗

read the original abstract

When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded, forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be forward-thinking when denoising by explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku-based planning task. We then scale Relay to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks while reducing inference latency by up to 32%. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Relay adds a trainable per-token channel with truncated BPTT to carry state across denoising steps in MDMs, with claims of scaling to Fast-dLLM v2 and 32% latency cuts on coding, but the abstract leaves controls and transfer details thin.

read the letter

The main point is that the authors introduce Learned Relay Representations: a differentiable per-token channel trained with truncated BPTT so that masked diffusion models can pass useful internal state forward instead of recomputing from scratch at every denoising round. They first validate the design on a Sudoku planning task, then plug it into Fast-dLLM v2 and report gains over plain supervised finetuning plus up to 32% lower inference latency, while claiming easy compatibility with block diffusion and KV caching. Code is released, which is useful on its own.

Referee Report

3 major / 1 minor

Summary. The paper introduces Learned Relay Representations (Relay) for Masked Diffusion Models (MDMs), which learns a differentiable per-token channel to propagate latent information between denoising steps, trained using truncated backpropagation through time (BPTT). Design choices are justified on a Sudoku-based planning task before scaling to Fast-dLLM v2, where Relay is claimed to outperform standard supervised finetuning on coding tasks, reduce inference latency by up to 32%, and remain compatible with block diffusion and KV caching. The manuscript provides code for all experiments.

Significance. Should the empirical findings prove robust, this work has the potential to advance diffusion language models by enabling explicit forward propagation of useful latent states, improving both performance and efficiency. The release of code is a positive aspect that supports reproducibility and further research in the area.

major comments (3)

[Scaling experiments to Fast-dLLM v2] The transfer of the relay channel learned via truncated BPTT on Sudoku to the larger Fast-dLLM v2 model is central to the claims of outperformance and latency reduction, yet the manuscript does not report any analysis of training stability, convergence issues, or the extent of hyperparameter retuning required, leaving open the possibility that gains are not solely attributable to the relay mechanism.
[Empirical results on coding tasks] The abstract and results claim outperforming SFT and 32% latency reduction, but provide no details on experimental controls, number of runs, statistical significance, or ablation studies, which undermines the ability to assess the reliability of these load-bearing empirical results.
[Compatibility with optimizations] The claim of seamless compatibility with block diffusion and KV caching lacks specific implementation details or ablations showing that the per-token relay channel integrates without performance degradation or additional overhead.

minor comments (1)

[Abstract] The abstract could benefit from a brief mention of the scale of the Sudoku task or key hyperparameters to provide context for the justification step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with proposed changes to the manuscript.

read point-by-point responses

Referee: [Scaling experiments to Fast-dLLM v2] The transfer of the relay channel learned via truncated BPTT on Sudoku to the larger Fast-dLLM v2 model is central to the claims of outperformance and latency reduction, yet the manuscript does not report any analysis of training stability, convergence issues, or the extent of hyperparameter retuning required, leaving open the possibility that gains are not solely attributable to the relay mechanism.

Authors: The Sudoku task served to rigorously justify the relay design under controlled planning conditions before scaling. We agree that explicit discussion of the transfer would strengthen the claims. In the revision we will add a subsection describing the hyperparameter transfer process, observed convergence on Fast-dLLM v2, and the limited retuning performed, while noting that the code release permits independent verification of stability. revision: yes
Referee: [Empirical results on coding tasks] The abstract and results claim outperforming SFT and 32% latency reduction, but provide no details on experimental controls, number of runs, statistical significance, or ablation studies, which undermines the ability to assess the reliability of these load-bearing empirical results.

Authors: We acknowledge that the manuscript does not currently report the requested experimental details. The released code contains the full evaluation pipeline. In the revision we will expand the experimental section to specify the number of runs, any observed variance, and additional ablations isolating the relay channel's contribution to both accuracy and latency gains. revision: yes
Referee: [Compatibility with optimizations] The claim of seamless compatibility with block diffusion and KV caching lacks specific implementation details or ablations showing that the per-token relay channel integrates without performance degradation or additional overhead.

Authors: The per-token relay channel is architecturally orthogonal to block processing and KV caching. We agree that concrete details are needed. The revision will include pseudocode for the integration, explicit statements on state maintenance across blocks, and any measured overhead from our existing experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirically trained and evaluated

full rationale

The paper introduces Learned Relay Representations as a new differentiable per-token channel trained via truncated BPTT. Design choices are justified empirically on a Sudoku planning task, then the module is scaled and evaluated on Fast-dLLM v2 for coding tasks, with reported gains over SFT and latency reductions. No derivation chain reduces a claimed result to its own fitted inputs by construction, no self-citation is load-bearing for a uniqueness claim, and no ansatz or renaming is smuggled in. The central claims rest on external empirical benchmarks rather than tautological reparameterization. This is the expected self-contained case for a trainable architectural addition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the learned relay channel and its training procedure; no explicit free parameters, axioms, or invented entities are detailed in the abstract beyond the introduction of the relay channel itself.

axioms (1)

domain assumption Truncated backpropagation through time suffices to train the relay channel without vanishing or exploding gradients across denoising steps.
Abstract states the training method relies on truncated BPTT.

pith-pipeline@v0.9.0 · 5772 in / 1286 out tokens · 28303 ms · 2026-05-25T05:55:50.454217+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

[1]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Neural Information Processing Systems, pages 17981--17993, 2021

work page 2021
[2]

Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. In Neural Information Processing Systems, pages 28266--28279. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022. doi:10.48550/arXiv.2205.14987. URL https://openreview.net/foru...

work page doi:10.48550/arxiv.2205.14987 2022
[3]

Simple and effective masked diffusion language models,

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models, 2024. URL http://arxiv.org/abs/2406.07524

work page arXiv 2024
[4]

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and generalized masked diffusion for discrete data, 2024. URL http://arxiv.org/abs/2406.04329

work page arXiv 2024
[5]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, International Conference on Machine L...

work page doi:10.48550/arxiv.2410.08292 2024
[6]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In International Conference on Learning Representations, 2025. doi:10.48550/arXiv.2502.17416

work page doi:10.48550/arxiv.2502.17416 2025
[7]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In International Conference on Learning Representations, 2024. doi:10.48550/arXiv.2402.12875

work page doi:10.48550/arxiv.2402.12875 2024
[8]

Paul J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78 0 (10): 0 1550--1560, 1990. doi:10.1109/5.58337

work page doi:10.1109/5.58337 1990
[9]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhi-Hong Qi, Jiaqi Han, S. Sahoo, and V. Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025. URL http://arxiv.org/abs/2503.09573

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. arXiv.org, 2025 a . doi:10.48550/arXiv.2505.22618

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22618 2025
[11]

Fast- dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm, 2025 b . URL https://arxiv.org/abs/2509.26328

work page arXiv 2025
[12]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, 2024. doi:10.48550/arXiv.2409.02908. URL https://openreview.net/forum?id=CTC7CmirNr

work page doi:10.48550/arxiv.2409.02908 2024
[13]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Chao You, Xiaojie Zhang, Jingyang Ou, and Jun Zhu. LLaDA : Large language diffusion with autoregressive initialization, 2025. URL http://arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking , 2025

Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking , 2025

work page 2025
[15]

Train for the worst, plan for the best: Understand- ing token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025a

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. In International Conference on Machine Learning, 2025. doi:10.48550/arXiv.2502.06768. URL https://openreview.net/forum?id=DjJmre5IkP

work page doi:10.48550/arxiv.2502.06768 2025
[16]

Sultan, Andrew McCallum, and Ramón Fernandez

Dhruvesh Patel, Tahira Naseem, Gaurav Pandey, M. Sultan, Andrew McCallum, and Ramón Fernandez. Improved sampling from masked diffusion models with position contrastive guidance. In NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling , 2025. URL https://openreview.net/forum?id=e0WmOrWbtc

work page 2025
[17]

Kakade, and Sitan Chen

Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, S. Kakade, and Sitan Chen. Stop training for the worst: Progressive unmasking accelerates masked diffusion training, 2026. URL http://arxiv.org/abs/2602.10314

work page arXiv 2026
[18]

There is no 16-clue sudoku: Solving the sudoku minimum number of clues problem via hitting set enumeration

Gary McGuire, Bastian Tugemann, and Gilles Civario. There is no 16-clue sudoku: Solving the sudoku minimum number of clues problem via hitting set enumeration. Experimental Mathematics, 23 0 (2): 0 190--217, 2012. doi:10.1080/10586458.2013.870056

work page doi:10.1080/10586458.2013.870056 2012
[19]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Chang-Le Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi-Yadkori. Hierarchical reasoning model. arXiv.org, 2025. doi:10.48550/arXiv.2506.21734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.21734 2025
[20]

sudoku-solver : a python S udoku solver that traces the human-style strategies it uses

Tim Vink. sudoku-solver : a python S udoku solver that traces the human-style strategies it uses. https://github.com/timvink/sudoku-solver, 2024

work page 2024
[21]

Qwen2.5 Technical Report

Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al

A. Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv.org, 2025. doi:10.48550/arXiv.2505.00949

work page doi:10.48550/arxiv.2505.00949 2025
[23]

Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Neural Information Processing Systems, pages 21558--21572. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023. doi:10.52202/075280-0943. URL https://openr...

work page doi:10.52202/075280-0943 2023
[24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URL http://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2020. URL http://arxiv.org/abs/2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025. URL https://arxiv.org/abs/2508.15487

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

Mingyu Jo, Jaesik Yoon, Justin Deschenaux, Caglar Gulcehre, and Sungjin Ahn. Loopholing discrete diffusion: Deterministic bypass of the sampling wall, 2025. URL http://arxiv.org/abs/2510.19304

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

MetaState : Persistent working memory for discrete diffusion language models, 2026

Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Qirui Jin, and Wenke Lee. MetaState : Persistent working memory for discrete diffusion language models, 2026. URL http://arxiv.org/abs/2603.01331

work page arXiv 2026
[29]

Continuously augmented discrete diffusion model for categorical generative modeling, 2025

Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling, 2025. URL http://arxiv.org/abs/2510.01329

work page arXiv 2025
[30]

Soft-masked diffusion language models, 2025

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models, 2025. URL http://arxiv.org/abs/2510.17206

work page arXiv 2025
[31]

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, and Cheng Zhang. Variational autoencoding discrete diffusion with enhanced dimensional correlations modeling, 2025. URL http://arxiv.org/abs/2505.17384

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Generative recursive reasoning models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Generative recursive reasoning models. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/pdf?id=Vxu6kcIjwV

work page 2025
[33]

x LM : A python package for non-autoregressive language models

Dhruvesh Patel, Durga Prasad Maram, Sai Sreenivas Chintha, Benjamin Rozonoyer, and Andrew McCallum. x LM : A python package for non-autoregressive language models. In Danilo Croce, Jochen Leidner, and Nafise Sadat Moosavi, editors, Proceedings of the 19th Conference of the E uropean Chapter of the ACL (Volume 3: System Demonstrations) , pages 445--456, Ra...

work page doi:10.18653/v1/2026.eacl-demo.31 2026

[1] [1]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Neural Information Processing Systems, pages 17981--17993, 2021

work page 2021

[2] [2]

Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and A. Doucet. A continuous time framework for discrete denoising models. In Neural Information Processing Systems, pages 28266--28279. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022. doi:10.48550/arXiv.2205.14987. URL https://openreview.net/foru...

work page doi:10.48550/arxiv.2205.14987 2022

[3] [3]

Simple and effective masked diffusion language models,

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models, 2024. URL http://arxiv.org/abs/2406.07524

work page arXiv 2024

[4] [4]

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and generalized masked diffusion for discrete data, 2024. URL http://arxiv.org/abs/2406.04329

work page arXiv 2024

[5] [5]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, International Conference on Machine L...

work page doi:10.48550/arxiv.2410.08292 2024

[6] [6]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In International Conference on Learning Representations, 2025. doi:10.48550/arXiv.2502.17416

work page doi:10.48550/arxiv.2502.17416 2025

[7] [7]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In International Conference on Learning Representations, 2024. doi:10.48550/arXiv.2402.12875

work page doi:10.48550/arxiv.2402.12875 2024

[8] [8]

Paul J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78 0 (10): 0 1550--1560, 1990. doi:10.1109/5.58337

work page doi:10.1109/5.58337 1990

[9] [9]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhi-Hong Qi, Jiaqi Han, S. Sahoo, and V. Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models, 2025. URL http://arxiv.org/abs/2503.09573

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding. arXiv.org, 2025 a . doi:10.48550/arXiv.2505.22618

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22618 2025

[11] [11]

Fast- dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm, 2025 b . URL https://arxiv.org/abs/2509.26328

work page arXiv 2025

[12] [12]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, 2024. doi:10.48550/arXiv.2409.02908. URL https://openreview.net/forum?id=CTC7CmirNr

work page doi:10.48550/arxiv.2409.02908 2024

[13] [13]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Chao You, Xiaojie Zhang, Jingyang Ou, and Jun Zhu. LLaDA : Large language diffusion with autoregressive initialization, 2025. URL http://arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking , 2025

Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, and Brian Karrer. Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking , 2025

work page 2025

[15] [15]

Train for the worst, plan for the best: Understand- ing token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025a

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. In International Conference on Machine Learning, 2025. doi:10.48550/arXiv.2502.06768. URL https://openreview.net/forum?id=DjJmre5IkP

work page doi:10.48550/arxiv.2502.06768 2025

[16] [16]

Sultan, Andrew McCallum, and Ramón Fernandez

Dhruvesh Patel, Tahira Naseem, Gaurav Pandey, M. Sultan, Andrew McCallum, and Ramón Fernandez. Improved sampling from masked diffusion models with position contrastive guidance. In NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling , 2025. URL https://openreview.net/forum?id=e0WmOrWbtc

work page 2025

[17] [17]

Kakade, and Sitan Chen

Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, S. Kakade, and Sitan Chen. Stop training for the worst: Progressive unmasking accelerates masked diffusion training, 2026. URL http://arxiv.org/abs/2602.10314

work page arXiv 2026

[18] [18]

There is no 16-clue sudoku: Solving the sudoku minimum number of clues problem via hitting set enumeration

Gary McGuire, Bastian Tugemann, and Gilles Civario. There is no 16-clue sudoku: Solving the sudoku minimum number of clues problem via hitting set enumeration. Experimental Mathematics, 23 0 (2): 0 190--217, 2012. doi:10.1080/10586458.2013.870056

work page doi:10.1080/10586458.2013.870056 2012

[19] [19]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Chang-Le Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi-Yadkori. Hierarchical reasoning model. arXiv.org, 2025. doi:10.48550/arXiv.2506.21734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.21734 2025

[20] [20]

sudoku-solver : a python S udoku solver that traces the human-style strategies it uses

Tim Vink. sudoku-solver : a python S udoku solver that traces the human-style strategies it uses. https://github.com/timvink/sudoku-solver, 2024

work page 2024

[21] [21]

Qwen2.5 Technical Report

Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, et al. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al

A. Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv.org, 2025. doi:10.48550/arXiv.2505.00949

work page doi:10.48550/arxiv.2505.00949 2025

[23] [23]

Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Neural Information Processing Systems, pages 21558--21572. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023. doi:10.52202/075280-0943. URL https://openr...

work page doi:10.52202/075280-0943 2023

[24] [24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URL http://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020

[25] [25]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2020. URL http://arxiv.org/abs/2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2020

[26] [26]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models, 2025. URL https://arxiv.org/abs/2508.15487

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

Mingyu Jo, Jaesik Yoon, Justin Deschenaux, Caglar Gulcehre, and Sungjin Ahn. Loopholing discrete diffusion: Deterministic bypass of the sampling wall, 2025. URL http://arxiv.org/abs/2510.19304

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

MetaState : Persistent working memory for discrete diffusion language models, 2026

Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Qirui Jin, and Wenke Lee. MetaState : Persistent working memory for discrete diffusion language models, 2026. URL http://arxiv.org/abs/2603.01331

work page arXiv 2026

[29] [29]

Continuously augmented discrete diffusion model for categorical generative modeling, 2025

Huangjie Zheng, Shansan Gong, Ruixiang Zhang, Tianrong Chen, Jiatao Gu, Mingyuan Zhou, Navdeep Jaitly, and Yizhe Zhang. Continuously augmented discrete diffusion model for categorical generative modeling, 2025. URL http://arxiv.org/abs/2510.01329

work page arXiv 2025

[30] [30]

Soft-masked diffusion language models, 2025

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, and Abbas Rahimi. Soft-masked diffusion language models, 2025. URL http://arxiv.org/abs/2510.17206

work page arXiv 2025

[31] [31]

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, and Cheng Zhang. Variational autoencoding discrete diffusion with enhanced dimensional correlations modeling, 2025. URL http://arxiv.org/abs/2505.17384

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Generative recursive reasoning models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Generative recursive reasoning models. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/pdf?id=Vxu6kcIjwV

work page 2025

[33] [33]

x LM : A python package for non-autoregressive language models

Dhruvesh Patel, Durga Prasad Maram, Sai Sreenivas Chintha, Benjamin Rozonoyer, and Andrew McCallum. x LM : A python package for non-autoregressive language models. In Danilo Croce, Jochen Leidner, and Nafise Sadat Moosavi, editors, Proceedings of the 19th Conference of the E uropean Chapter of the ACL (Volume 3: System Demonstrations) , pages 445--456, Ra...

work page doi:10.18653/v1/2026.eacl-demo.31 2026