arxiv: 2604.02340 · v2 · submitted 2026-02-04 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Ivan Sedykh , Nikita Sorokin , Valentin Malykh

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords masked diffusion language modelsmodel schedulingdenoising stepsefficient samplingFLOPs reductiongenerative perplexitydiffusion models

0 comments

The pith

In masked diffusion language models, early and late denoising steps tolerate a smaller model better than middle steps, cutting FLOPs by up to 17 percent with modest perplexity cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Masked diffusion language models generate text through many full-sequence denoising passes with a large transformer, which is slower than autoregressive decoding. The paper shows that not every step needs the full model size: early and late steps remain robust when a smaller model is substituted, while middle steps are far more sensitive. Scheduling the smaller model for the robust portions of the trajectory reduces computation noticeably while keeping generative perplexity degradation modest and sample diversity intact. This pattern holds for both unconditional and prefix-conditional generation on the tested datasets. The authors locate the sensitive region through loss and KL-divergence analysis across timesteps plus exhaustive checks on coarse step segments.

Core claim

The central claim is that denoising steps in MDLMs are not equally important for model capacity; early and late steps can be handled by a smaller MDLM with little quality loss, whereas middle steps require the full model, as shown by higher loss and KL divergence when the small model is used there, enabling simple model scheduling that achieves up to 17 percent FLOP reduction under unconditional and prefix-conditional generation while preserving sample diversity.

What carries the argument

Model scheduling that replaces the full MDLM with a smaller one at selected denoising timesteps, with importance ranked by per-step loss and KL divergence between small and large models.

If this is right

Up to 17 percent fewer FLOPs during sampling without retraining.
Only modest rise in generative perplexity for both unconditional and prefix-conditional tasks.
Sample diversity remains comparable to the full-model baseline.
Middle-step sensitivity appears consistently across OpenWebText and LM1B.
Coarse segment-based schedules already deliver most of the savings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-middle-late sensitivity pattern might appear in diffusion models for other modalities or architectures.
Adaptive per-step model selection could be extended with learned routers instead of fixed segments.
The finding suggests that diffusion trajectories contain phases of varying computational sensitivity that could be exploited in non-language settings.

Load-bearing premise

A smaller model can be swapped in at chosen steps without retraining or architectural changes, and the middle-step sensitivity pattern holds beyond the two datasets and model sizes tested.

What would settle it

Measuring a large drop in sample quality or diversity when a smaller model is used in early or late steps on a new dataset, or finding that middle steps are not the most sensitive when the analysis is repeated on a substantially larger model.

Figures

Figures reproduced from arXiv: 2604.02340 by Ivan Sedykh, Nikita Sorokin, Valentin Malykh.

**Figure 1.** Figure 1: Generative perplexity for model schedules using a heavy 12-block model and a light 4-block model with exactly 250/1000 light steps (16.7% saved FLOPs) on OpenWebText. Each bar label encodes a schedule as contiguous segments, e.g., (L125, H750, L125) denotes the sandwich schedule (125 light steps, 750 heavy steps, 125 light steps), while placing all light steps in the 2nd or 3rd quarter yields the worst per… view at source ↗

**Figure 2.** Figure 2: Comparison of the top 5 best (left) and worst (right) model scheduling configurations among the 210 coarse schedules. Each row shows one configuration. Red bars indicate light (4- block) model placement. Segments 0–9 correspond to steps 0– 100, 100–200, . . . , 900–1000, where segment 0 is closest to the fully masked state (t ≈ 1). Best configurations concentrate light segments near both ends, while worst … view at source ↗

**Figure 3.** Figure 3: Segment frequency in the top 20 best-performing configurations (lowest perplexity). Bars show how often each segment is assigned to the light (4-block) model across the top-20 schedules. Higher frequency suggests that replacing this segment is relatively safe. 3.4. Scaling over light model size / light-step fraction We next study two scaling dimensions: (i) the size of the light model and (ii) the fractio… view at source ↗

**Figure 4.** Figure 4: Mean absolute difference in masked-token cross-entropy between each light model and the heavy 12-block baseline across timesteps. Each curve compares one light model to the baseline, evaluated on the same corrupted inputs zt. Lower values indicate higher similarity. we observe a clear peak in the middle of the trajectory, indicating maximal disagreement between light and heavy models at intermediate noise… view at source ↗

**Figure 5.** Figure 5: Relative token-level KL divergence (Eq. 6) between model pairs across timesteps, after subtracting a baseline KL curve computed between two independently trained heavy (12-block) checkpoints. Lower values indicate closer agreement. Divergence peaks in the middle of the trajectory, suggesting that intermediate timesteps are most sensitive to model replacement. 0 1 2 3 4 5 6 7 8 9 Segment Index (0=steps 0-10… view at source ↗

**Figure 6.** Figure 6: Mean-subtracted segment influence from the exhaustive 10-segment search (Section 3.3). For each segment j, we compute the mean perplexity over schedules that assign segment j to the light model, and subtract the mean perplexity over all schedules. Positive values indicate that replacing this segment is harmful on average; negative values indicate that replacing this segment is relatively safe. segment j us… view at source ↗

**Figure 7.** Figure 7: Generative perplexity (GPT-2) for hand-crafted model schedules using a heavy 12-block model and a light 6-block model with exactly 250/1000 light steps. Each bar label encodes a schedule as contiguous segments, e.g., (L125, H750, L125) denotes the sandwich schedule (125 light steps, 750 heavy steps, 125 light steps). A. Additional Light Model Results Additional hand-crafted schedules for light models with… view at source ↗

**Figure 10.** Figure 10: Segment frequency in the bottom 20 worst-performing configurations (highest perplexity) from the exhaustive search (Section 3.3). Bars show how often each segment is assigned to the light (4-block) model. Higher frequency suggests that replacing this segment is harmful. 128L 32L 96H 64H 32L 32H 64H 64L 128H 96H 32L 16L 96H 16L 128H Step Configuration (L=Light, H=Heavy) 180 200 220 240 Perplexity [PITH_FU… view at source ↗

**Figure 8.** Figure 8: Generative perplexity (GPT-2) for hand-crafted model schedules using a heavy 12-block model and a light 8-block model with exactly 250/1000 light steps. Each bar label encodes a schedule as contiguous segments, e.g., (L125, H750, L125) denotes the sandwich schedule (125 light steps, 750 heavy steps, 125 light steps). 1000L 250L 750H 250H 250L 500H 500H 250L 250H 750H 250L 125L 750H 125L 1000H Step Configu… view at source ↗

**Figure 11.** Figure 11: Generative perplexity for hand-crafted model schedules on LM1B (128-token context, 4-block light / 12-block heavy, 250/1000 light steps). The middle-step sensitivity pattern from [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 9.** Figure 9: Generative perplexity (GPT-2) for hand-crafted model schedules using a heavy 12-block model and a light 10-block model with exactly 250/1000 light steps. Each bar label encodes a schedule as contiguous segments, e.g., (L125, H750, L125) denotes the sandwich schedule (125 light steps, 750 heavy steps, 125 light steps). 0 1 2 3 4 5 6 7 8 9 Segment Index (0=steps 0-100, 9=steps 900-1000) 0.0 2.5 5.0 7.5 10.0… view at source ↗

**Figure 12.** Figure 12: Relative token-level KL divergence (Eq. 6) between model pairs trained on LM1B across timesteps, after subtracting a baseline KL curve computed between two independently trained heavy (12-block) checkpoints. The same middle-trajectory peak observed on OpenWebText ( [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Conditional generative perplexity for hand-crafted schedules on OpenWebText with 256-token prefixes (4-block light / 12-block heavy, 250/1000 light steps). The schedule ranking matches the unconditional setting ( [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

read the original abstract

Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. Across models trained on OpenWebText and LM1B, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity under both unconditional and prefix-conditional generation, while preserving sample diversity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive consistently across datasets. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that masked diffusion language models (MDLMs) can be accelerated via model scheduling, in which a smaller MDLM replaces the large model at selected denoising timesteps. Step-importance analysis using per-timestep loss and KL divergence between small and large models, together with exhaustive search over coarse segments, shows that early and late steps are substantially more robust to replacement than middle steps. This enables up to 17% FLOPs reduction with only modest generative-perplexity degradation on OpenWebText and LM1B under both unconditional and prefix-conditional generation, while preserving sample diversity.

Significance. If the empirical findings hold, the work supplies a practical, architecture-agnostic acceleration technique for MDLM sampling that avoids retraining and KV-cache limitations of autoregressive models. The consistent identification of middle-step sensitivity across two datasets and generation modes is a useful diagnostic insight, and the reported preservation of diversity alongside compute savings would be a notable practical contribution to efficient diffusion-based language generation.

major comments (2)

[step-importance analysis and scheduling experiments] The central validation relies on per-step loss and KL divergence to identify robust segments, yet the manuscript does not report any direct measurement of cumulative trajectory inconsistency or error propagation when models are swapped mid-denoising. Because each step conditions the next input distribution, local robustness does not automatically guarantee global stability of the full Markov chain, especially under prefix-conditional generation.
[experimental results] The reported 17% FLOPs reduction and associated perplexity figures lack error bars, multiple random seeds, or statistical significance tests, and the training protocols for the small and large MDLMs are not detailed. These omissions make it difficult to assess whether the observed modest degradation is reliable or reproducible.

minor comments (1)

[abstract and method] The abstract refers to 'exhaustive search over coarse step segments' without specifying the segment granularity or the exact search procedure; the main text should make these choices explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the strengths and limitations of our analysis. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [step-importance analysis and scheduling experiments] The central validation relies on per-step loss and KL divergence to identify robust segments, yet the manuscript does not report any direct measurement of cumulative trajectory inconsistency or error propagation when models are swapped mid-denoising. Because each step conditions the next input distribution, local robustness does not automatically guarantee global stability of the full Markov chain, especially under prefix-conditional generation.

Authors: We agree that local per-step metrics alone do not fully capture potential error accumulation in the Markov chain. However, our scheduling results are obtained by executing complete denoising trajectories with the hybrid model schedule (small model substituted only in the identified robust segments) and measuring end-to-end generative perplexity and diversity on both unconditional and prefix-conditional tasks. The exhaustive search over coarse segments therefore evaluates the full sampling process, including any propagation effects that arise from mid-trajectory swaps. To make this explicit, we have added a short paragraph in the revised manuscript noting that the reported metrics reflect complete trajectories rather than isolated steps. revision: partial
Referee: [experimental results] The reported 17% FLOPs reduction and associated perplexity figures lack error bars, multiple random seeds, or statistical significance tests, and the training protocols for the small and large MDLMs are not detailed. These omissions make it difficult to assess whether the observed modest degradation is reliable or reproducible.

Authors: We acknowledge these omissions reduce reproducibility. In the revised manuscript we now report error bars computed over three independent random seeds for all perplexity and diversity numbers, include a brief note on statistical significance of the observed differences, and expand the experimental setup section with full training details for both model sizes (optimizer, learning-rate schedule, batch size, number of training steps, and data preprocessing). revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct per-step empirical measurements

full rationale

The paper identifies sensitive timesteps via explicit computation of loss and KL divergence between small and large MDLMs at each t, followed by exhaustive enumeration of coarse segments for scheduling. These quantities are measured independently from the models' forward passes and are not defined in terms of the final scheduling rule or generative perplexity. No derivation reduces a prediction to a fitted parameter by construction, and no self-citation supplies a uniqueness theorem or ansatz that forces the result. The central claim therefore remains an empirical observation rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work adds no new free parameters beyond empirical schedule choice, relies on standard diffusion process assumptions, and introduces no invented entities.

free parameters (1)

replacement schedule segments
Specific early/late segments chosen via analysis and exhaustive search on the datasets.

axioms (1)

domain assumption The masked diffusion process permits independent model evaluation at individual timesteps without violating the overall generative distribution
Invoked to justify swapping models at selected steps.

pith-pipeline@v0.9.0 · 5507 in / 1256 out tokens · 50655 ms · 2026-05-16T07:41:43.846650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 14 internal anchors

[1]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V . Block Diffu- sion: Interpolating Between Autoregressive and Diffusion Language Models, May 2025a. URL http://arxiv. org/abs/2503.09573. arXiv:2503.09573 [cs]. Arriola, M., Schiff, Y ., Phung, H., Gokaslan, A., and Kuleshov, V . Encoder-Decoder Diffusion Lang...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

org/abs/2107.03006

URL http://arxiv. org/abs/2107.03006. arXiv:2107.03006 [cs]. Bell, T., Mudireddy, A., Johnson-Eversoll, I., Dasgupta, S., and Mudumbai, R. Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models, Septem- ber

work page arXiv
[3]

URL http://arxiv.org/abs/2405. 13798. arXiv:2405.13798 [cs]. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S...

work page arXiv 2005
[4]

BERT: pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for lan- guage understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.),Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapoli...

work page 2019
[5]

BERT: Pre-training of deep bidirectional transformers for language understanding

doi: 10.18653/V1/N19-1423. URL https://doi.org/ 10.18653/v1/n19-1423. Dhariwal, P. and Nichol, A. Diffusion Models Beat GANs on Image Synthesis, June

work page doi:10.18653/v1/n19-1423
[6]

Diffusion Models Beat GANs on Image Synthesis

URL http://arxiv. org/abs/2105.05233. arXiv:2105.05233 [cs]. Gokaslan, A., Cohen, V ., Pavlick, E., and Tellex, S. Open- webtext corpus. http://Skylion007.github. io/OpenWebTextCorpus,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv:2210.08933 [cs]

URL http://arxiv.org/ abs/2210.08933. arXiv:2210.08933 [cs]. Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . DiffuCoder: Understanding and Im- proving Masked Diffusion Models for Code Generation, June

work page arXiv
[8]

URL http://arxiv.org/abs/2506. 20639. arXiv:2506.20639 [cs]. 9 Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models He, Z., Sun, T., Wang, K., Huang, X., and Qiu, X. Diffusion- BERT: Improving Generative Masked Language Models with Diffusion Models, November

work page arXiv
[9]

arXiv:2211.15029 [cs] version:

URL http:// arxiv.org/abs/2211.15029. arXiv:2211.15029 [cs] version:

work page arXiv
[10]

arXiv:2503.09202 [cs] version:

URL http://arxiv.org/abs/ 2503.09202. arXiv:2503.09202 [cs] version:

work page arXiv
[11]

URL http://arxiv.org/abs/2207. 12598. arXiv:2207.12598 [cs]. Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Proba- bilistic Models, December

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Denoising Diffusion Probabilistic Models

URLhttp://arxiv. org/abs/2006.11239. arXiv:2006.11239 [cs]. Hsu, P.-L., Dai, Y ., Kothapalli, V ., Song, Q., Tang, S., Zhu, S., Shimizu, S., Sahni, S., Ning, H., Chen, Y ., and Wang, Z. Liger-kernel: Efficient triton kernels for LLM training. InChampioning Open-source DEvelop- ment in ML Workshop @ ICML25,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[13]

S., Seo, J.-s., Zhang, Z., and Gupta, U

URL https: //openreview.net/forum?id=36SjAIT42G. Hu, Z., Meng, J., Akhauri, Y ., Abdelfattah, M. S., sun Seo, J., Zhang, Z., and Gupta, U. Flashdlm: Accelerating dif- fusion language model inference via efficient kv caching and guided diffusion.arXiv preprint arXiv: 2505.21467,

work page arXiv
[14]

arXiv:2409.17566 [cs]

URL http://arxiv.org/ abs/2409.17566. arXiv:2409.17566 [cs]. Kong, F., Zhang, J., Liu, Y ., Wu, Z., Tian, Y ., W., V ., and Zhou, G. Accelerating diffusion llm inference via local determinism propagation.arXiv preprint arXiv: 2510.07081,

work page arXiv
[15]

L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T

Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, ...

work page 2022
[16]

nips.cc/paper_files/paper/2022/hash/ 1be5bc25d50895ee656b8c2d9eb89d6a-Abstract-Conference

URL http://papers. nips.cc/paper_files/paper/2022/hash/ 1be5bc25d50895ee656b8c2d9eb89d6a-Abstract-Conference. html. Liang, Y ., Wang, Z., Chen, H., Sun, X., Wu, J., Yu, X., Liu, J., Barsoum, E., Liu, Z., and Jha, N. K. Cd4lm: Consistency distillation and adaptive decoding for diffu- sion language models.arXiv preprint arXiv: 2601.02236,

work page arXiv 2022
[17]

Decoupled Weight Decay Regularization

URL https://proceedings. mlr.press/v202/liu23ab.html. Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv: 1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

URL http://arxiv.org/abs/2310. 16834. arXiv:2310.16834 [stat]. Lovelace, J., Kishore, V ., Wan, C., Shekhtman, E., and Weinberger, K. Q. Latent Diffusion for Language Gen- eration, November

work page internal anchor Pith review Pith/arXiv arXiv
[19]

arXiv:2212.09462 [cs]

URL http://arxiv.org/ abs/2212.09462. arXiv:2212.09462 [cs]. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. DPM-Solver: A Fast ODE Solver for Diffusion Prob- abilistic Model Sampling in Around 10 Steps, Octo- ber

work page arXiv
[20]

URL http://arxiv.org/abs/2206. 00927. arXiv:2206.00927 [cs]. Ma, X., Fang, G., Mi, M. B., and Wang, X. Learning-to- Cache: Accelerating Diffusion Transformer via Layer Caching, November

work page arXiv
[21]

org/abs/2406.01733

URL http://arxiv. org/abs/2406.01733. arXiv:2406.01733 [cs]. Ma, Y ., Du, L., Wei, L., Chen, K., Xu, Q., Wang, K., Feng, G., Lu, G., Liu, L., Qi, X., Zhang, X., Tao, Z., Feng, H., Jiang, Z., Xu, Y ., Huang, Z., Zhuang, Y ., Xu, H., Hu, J., Lan, Z., Zhao, J., Li, J., and Zheng, D. dinfer: An efficient inference framework for diffusion language models.arXiv...

work page arXiv
[22]

arXiv:2506.21170 [cs]

URL http:// arxiv.org/abs/2506.21170. arXiv:2506.21170 [cs]. Ni, J., Liu, Q., Dou, L., Du, C., Wang, Z., Yan, H., Pang, T., and Shieh, M. Q. Diffusion Language Models are Super Data Learners, November

work page arXiv
[23]

org/abs/2511.03276

URL http://arxiv. org/abs/2511.03276. arXiv:2511.03276 [cs]. Nie, S., Zhu, F., Du, C., Pang, T., Liu, Q., Zeng, G., Lin, M., and Li, C. Scaling up Masked Diffusion Models on Text, February 2025a. URL http://arxiv.org/ abs/2410.18514. arXiv:2410.18514 [cs]. 10 Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models N...

work page arXiv
[24]

Pan, Z., Liu, J., He, H., Cai, J., and Zhuang, B

Accessed: 2026-01-28. Pan, Z., Liu, J., He, H., Cai, J., and Zhuang, B. Stitched ViTs are Flexible Vision Backbones, Novem- ber

work page 2026
[25]

URL http://arxiv.org/abs/2307. 00154. arXiv:2307.00154 [cs]. Pan, Z., Zhuang, B., Huang, D.-A., Nie, W., Yu, Z., Xiao, C., Cai, J., and Anandkumar, A. T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajec- tory Stitching, February

work page arXiv
[26]

org/abs/2402.14167

URL http://arxiv. org/abs/2402.14167. arXiv:2402.14167 [cs]. Peebles, W. and Xie, S. Scalable Diffusion Models with Transformers, March

work page arXiv
[27]

Scalable Diffusion Models with Transformers

URL http://arxiv. org/abs/2212.09748. arXiv:2212.09748 [cs]. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models, April

work page internal anchor Pith review Pith/arXiv arXiv
[28]

High-Resolution Image Synthesis with Latent Diffusion Models

URL http://arxiv. org/abs/2112.10752. arXiv:2112.10752 [cs]. R¨utte, D. v., Fluri, J., Pooladzandi, O., Sch ¨olkopf, B., Hofmann, T., and Orvieto, A. Scaling Behav- ior of Discrete Diffusion Language Models, Decem- ber

work page internal anchor Pith review Pith/arXiv arXiv
[29]

URL http://arxiv.org/abs/2512. 10858. arXiv:2512.10858 [cs]. Sahoo, S. S., Arriola, M., Schiff, Y ., Gokaslan, A., Marro- quin, E., Chiu, J. T., Rush, A., and Kuleshov, V . Sim- ple and Effective Masked Diffusion Language Models, November

work page arXiv
[30]

arXiv:2406.07524 [cs]

URL http://arxiv.org/abs/ 2406.07524. arXiv:2406.07524 [cs]. Sahoo, S. S., Yang, Z., Akhauri, Y ., Liu, J., Singh, D., Cheng, Z., Liu, Z., Xing, E., Thickstun, J., and Vahdat, A. Esoteric Language Models, June

work page arXiv
[31]

arXiv:2506.01928 [cs]

URL http:// arxiv.org/abs/2506.01928. arXiv:2506.01928 [cs]. Salimans, T. and Ho, J. Progressive Distillation for Fast Sam- pling of Diffusion Models, June

work page arXiv
[32]

Progressive Distillation for Fast Sampling of Diffusion Models

URL http:// arxiv.org/abs/2202.00512. arXiv:2202.00512 [cs]. Shabalin, A., Meshchaninov, V ., and Vetrov, D. Smoothie: Smoothing Diffusion on Token Embeddings for Text Gen- eration, May

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv:2505.18853 [cs] version:

URL http://arxiv.org/abs/ 2505.18853. arXiv:2505.18853 [cs] version:

work page arXiv
[34]

arXiv:2406.04329 [cs]

URL http://arxiv.org/ abs/2406.04329. arXiv:2406.04329 [cs]. Song, J., Meng, C., and Ermon, S. Denoising Diffusion Implicit Models, October

work page arXiv
[35]

Denoising Diffusion Implicit Models

URL http://arxiv. org/abs/2010.02502. arXiv:2010.02502 [cs]. Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-Based Generative Mod- eling through Stochastic Differential Equations, Febru- ary

work page internal anchor Pith review Pith/arXiv arXiv 2010
[36]

URL http://arxiv.org/abs/2011. 13456. arXiv:2011.13456 [cs]. Song, Y ., Dhariwal, P., Chen, M., and Sutskever, I. Consis- tency Models, May

work page internal anchor Pith review Pith/arXiv arXiv 2011
[37]

Consistency Models

URL http://arxiv.org/ abs/2303.01469. arXiv:2303.01469 [cs]. Tang, S., Wang, Y ., Ding, C., Liang, Y ., Li, Y ., and Xu, D. AdaDiff: Accelerating Diffusion Mod- els through Step-Wise Adaptive Computation, Au- gust

work page internal anchor Pith review Pith/arXiv arXiv
[38]

URL http://arxiv.org/abs/2309. 17074. arXiv:2309.17074 [cs]. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V ., Goyal, N., Hartshorn, A., ...

work page arXiv
[39]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need.arXiv preprint arXiv: 1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

arXiv:2503.00307 [cs]

URL http://arxiv.org/abs/ 2503.00307. arXiv:2503.00307 [cs]. 11 Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models Wu, C., Zhang, H., Xue, S., Diao, S., Fu, Y ., Liu, Z., Molchanov, P., Luo, P., Han, S., and Xie, E. Fast- dLLM v2: Efficient Block-Diffusion LLM, Septem- ber 2025a. URL http://arxiv.org/abs/2509. ...

work page arXiv
[41]

URL http://arxiv.org/abs/2506. 17871. arXiv:2506.17871 [cs] version:

work page arXiv
[42]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7B: Diffusion Large Language Models, August 2025a. URL http://arxiv.org/ abs/2508.15487. arXiv:2508.15487 [cs]. Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y ., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., and Ceze, L. Flashinfer: Efficient and customizab...

work page internal anchor Pith review Pith/arXiv arXiv
[43]

URL http://arxiv.org/abs/2504. 06803. arXiv:2504.06803 [cs]. Zur, A., Geiger, A., Lubana, E. S., and Bigelow, E. Are language models aware of the road not taken? Token- level uncertainty and hidden state dynamics, Novem- ber

work page arXiv
[44]

URL http://arxiv.org/abs/2511. 04527. arXiv:2511.04527 [cs] version:

work page arXiv