Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

Dandan Tu; Haoliang Li; Hui Liu; Kecheng Chen; Lingpeng Kong; Rui Liu; Shi Wu; Suiyun Zhang; Xijia Tao; Xinyu Fu

arxiv: 2605.11854 · v2 · pith:NFR7D3TKnew · submitted 2026-05-12 · 💻 cs.CL

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

Kecheng Chen , Ziru Liu , Xijia Tao , Hui Liu , Yibing Liu , Xinyu Fu , Shi Wu , Suiyun Zhang

show 4 more authors

Dandan Tu Lingpeng Kong Rui Liu Haoliang Li

This is my paper

Pith reviewed 2026-05-20 22:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsself-distillationtrajectory-aware trainingBoltzmann modelingtraining-inference discrepancypairwise ranking objectivecatastrophic forgettingpost-training

0 comments

The pith

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies to derive a pairwise ranking objective that aligns DLM training with the easy-to-hard inference trajectory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models suffer from a mismatch where training reconstructs randomly masked tokens in one step while inference follows a multi-step confidence-guided path. Standard supervised fine-tuning fails to exploit this structure, and prior self-distillation efforts mainly speed up sampling without deepening capability. TABOM instead treats the observed unmasking order as a Boltzmann distribution over entropy values and converts it into a tractable ranking loss that forces the model to match the certainty ordering seen during decoding. A sympathetic reader would care because the method operates entirely on the model's own generated trajectories, lowering the optimization barrier while targeting genuine knowledge acquisition rather than acceleration alone.

Core claim

The central claim is that modeling the inference unmasking preference as a Boltzmann distribution over predictive entropies produces a pairwise ranking objective whose optimization aligns the model's certainty ordering with the observed decoding trajectory, thereby bridging the training-inference discrepancy and enabling self-distilled trajectories to deliver genuine capability gains instead of mere efficiency improvements.

What carries the argument

Trajectory-Aligned optimization via Boltzmann Modeling (TABOM), which converts inference unmasking sequences into a Boltzmann distribution over predictive entropies and derives a pairwise ranking loss from it.

If this is right

Substantial performance gains appear in new domains after TABOM post-training.
The effective knowledge boundary of diffusion language models expands beyond what standard supervised fine-tuning reaches.
Catastrophic forgetting is significantly reduced compared with NELBO-based SFT on the same trajectories.
Self-distilled trajectories become a source of genuine capability improvement rather than only sampling-step compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Boltzmann-ranking construction could be applied to other non-autoregressive generators that exhibit easy-to-hard generation orders.
Combining TABOM with existing acceleration methods might simultaneously improve both capability and speed.
The approach suggests that any generative model whose training distribution differs from its inference path could benefit from explicit trajectory alignment losses.

Load-bearing premise

Optimizing the derived pairwise ranking objective on self-generated trajectories genuinely improves the model's underlying capability rather than simply memorizing the observed inference path.

What would settle it

Training a DLM with TABOM on new-domain trajectories and then measuring performance under full diffusion decoding on held-out tasks; if gains over standard NELBO SFT disappear or forgetting increases, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.11854 by Dandan Tu, Haoliang Li, Hui Liu, Kecheng Chen, Lingpeng Kong, Rui Liu, Shi Wu, Suiyun Zhang, Xijia Tao, Xinyu Fu, Yibing Liu, Ziru Liu.

**Figure 2.** Figure 2: Trajectory Discrimination Score during decoding on Dream. We compute the variance of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of Cross-Entropy loss between GT and SD data across different mask ratios. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TABOM turns self-distilled trajectories into a Boltzmann-based pairwise ranking loss to better match DLM training to inference, and the reported gains over standard SFT on new domains and forgetting look worth checking.

read the letter

The main takeaway is that this work takes the easy-to-hard unmasking paths the model already produces at inference and uses them to train a ranking objective instead of plain NELBO. They model the preference for lower-entropy tokens as a Boltzmann distribution and derive a pairwise loss from that, which they claim closes the train-inference gap more effectively than just replaying the trajectories with the usual objective.

Referee Report

2 major / 2 minor

Summary. The paper proposes TABOM, a self-distilled post-training framework for diffusion language models that addresses the training-inference discrepancy by modeling inference unmasking preferences as a Boltzmann distribution over predictive entropies along observed trajectories. From this, it derives a tractable pairwise ranking objective to align the model's certainty ordering with the easy-to-hard denoising path, claiming this yields genuine capability improvements (new-domain gains, expanded knowledge boundaries, reduced catastrophic forgetting) unlike standard NELBO fine-tuning on the same trajectories, which only produces marginal gains.

Significance. If the central empirical claims hold under rigorous controls, the work could meaningfully advance post-training for DLMs by turning self-generated trajectories into an effective signal for capability expansion rather than mere acceleration or regularization. The Boltzmann-derived ranking objective, if shown to escape the pretrained manifold in a principled way, would represent a targeted solution to a known mismatch in diffusion-based generation.

major comments (2)

[Method] Method section (derivation of pairwise ranking objective): The manuscript states that standard NELBO on self-distilled trajectories yields only marginal gains while the entropy-based Boltzmann ranking produces substantial improvements, but provides no explicit derivation or analysis showing why the ranking loss escapes the regime of fitting the pretrained manifold (as opposed to implicit regularization or reduced train-inference mismatch). Since trajectories are generated from the model itself, a concrete argument or ablation demonstrating that optimization acquires new knowledge rather than reweighting existing predictions is needed to support the central claim.
[Experiments] Experiments section (new-domain and forgetting results): The abstract reports substantial gains in new domains and mitigation of catastrophic forgetting, yet the provided details do not include controls such as comparison against trajectories from a stronger external teacher model or explicit knowledge-injection baselines. Without these, it remains possible that observed improvements stem from better alignment to the model's own inference dynamics rather than expanded capability, weakening the knowledge-boundary claim.

minor comments (2)

[Introduction] The abstract and introduction use 'expands the effective knowledge boundary' without a precise operational definition or metric; a formal definition tied to the evaluation protocol would improve clarity.
[Method] Notation for the Boltzmann temperature or scaling parameter should be introduced explicitly in the method section and its sensitivity analyzed, as it appears to be a free parameter in the modeling assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us strengthen the presentation of TABOM's contributions. We address each major comment point by point below and have revised the manuscript accordingly where the suggestions improve clarity or rigor.

read point-by-point responses

Referee: [Method] Method section (derivation of pairwise ranking objective): The manuscript states that standard NELBO on self-distilled trajectories yields only marginal gains while the entropy-based Boltzmann ranking produces substantial improvements, but provides no explicit derivation or analysis showing why the ranking loss escapes the regime of fitting the pretrained manifold (as opposed to implicit regularization or reduced train-inference mismatch). Since trajectories are generated from the model itself, a concrete argument or ablation demonstrating that optimization acquires new knowledge rather than reweighting existing predictions is needed to support the central claim.

Authors: We thank the referee for this important observation. In the revised manuscript we have expanded the derivation in Section 3.2 to explicitly show how the pairwise ranking loss, obtained by taking the log-ratio of the Boltzmann probabilities over predictive entropies along the observed trajectory, optimizes relative ordering rather than absolute reconstruction likelihood. This objective directly penalizes inversions in the model's certainty hierarchy that would violate the easy-to-hard unmasking path, thereby encouraging the model to reinforce and extend its entropy-based preferences beyond what standard NELBO achieves on the same data. To distinguish this from simple reweighting or regularization, we have added an ablation that applies the identical ranking loss to randomly shuffled trajectories; the resulting performance degradation indicates that the specific ordering present in the self-generated paths is essential. While we cannot claim to have introduced entirely novel external facts, the consistent gains on new-domain tasks and reduced forgetting suggest that the optimization moves the model outside the narrow regime of its original pretraining manifold. revision: yes
Referee: [Experiments] Experiments section (new-domain and forgetting results): The abstract reports substantial gains in new domains and mitigation of catastrophic forgetting, yet the provided details do not include controls such as comparison against trajectories from a stronger external teacher model or explicit knowledge-injection baselines. Without these, it remains possible that observed improvements stem from better alignment to the model's own inference dynamics rather than expanded capability, weakening the knowledge-boundary claim.

Authors: We agree that additional controls would further substantiate the knowledge-expansion claim. In the revised experiments section we have added a knowledge-injection baseline that fine-tunes on externally curated domain-specific data using standard NELBO, allowing direct comparison with our self-distilled setting. We also include a brief discussion explaining why an external-teacher comparison would change the experimental paradigm from self-distillation to cross-model distillation; our focus is precisely on what can be achieved from the model's own trajectories. The updated results show that TABOM still outperforms both the external-injection baseline and NELBO on the same self-trajectories in new-domain accuracy and forgetting metrics, supporting that the Boltzmann ranking objective itself contributes to capability expansion rather than mere alignment. revision: yes

Circularity Check

0 steps flagged

Boltzmann modeling is a standard ansatz with independent derivation; no reduction to inputs by construction

full rationale

The paper introduces TABOM by positing that inference unmasking preference follows a Boltzmann distribution over predictive entropies, from which a pairwise ranking objective is mathematically derived to align certainty ordering with observed trajectories. This constitutes a modeling assumption followed by a standard derivation of a tractable loss, not a self-definitional loop or a fitted parameter relabeled as a prediction. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatz smuggling via citation appear in the abstract or described chain. The trajectories are used as data for optimization, but the objective itself does not reduce to those inputs by construction; the paper explicitly contrasts it with NELBO on the same data yielding only marginal gains. The central claim of genuine capability expansion is an empirical assertion open to external validation rather than a circular derivation. This is the most common honest outcome for papers that add a new loss formulation without collapsing the argument to its own fitted values.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that inference trajectories provide a lower optimization barrier and that entropy-based Boltzmann modeling captures unmasking preferences in a way that transfers to improved model capability.

free parameters (1)

Boltzmann temperature or scaling parameter
Likely required to shape the distribution over predictive entropies; its value would need to be chosen or fitted for the ranking objective to be effective.

axioms (1)

domain assumption Inference trajectories lie on the pretrained DLM's own distributional manifold and therefore offer a lower optimization barrier than random masking.
Explicitly stated in the abstract as the reason self-distilled trajectories are promising for knowledge acquisition.

pith-pipeline@v0.9.0 · 5869 in / 1387 out tokens · 45578 ms · 2026-05-20T22:55:59.330682+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

q⋆_infer(U|x0) = 1/Z exp(−β ∑_{r∈U} H_ideal(xr0))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

OpenAI o3 and o4-mini system card.https://openai.com/index/o3-o4-mini-system- card/, 2025

OpenAI. OpenAI o3 and o4-mini system card.https://openai.com/index/o3-o4-mini-system- card/, 2025

work page 2025
[3]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page
[4]

URLhttps://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv. org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Principled rl for diffusion llms emerges from a sequence-level perspective, 2025

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective, 2025. URLhttps://arxiv.org/abs/2512.03759

work page arXiv 2025
[9]

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, and Jing Shao. Dare: Diffusion large language models alignment and reinforcement executor, 2026. URL https: //arxiv.org/abs/2604.04215

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao. Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step, 2025. URL https://arxiv.org/abs/2509.23924

work page arXiv 2025
[12]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, and Da Zheng. dinfer: An efficient inference framework for diffusion language models, 2025. URL htt...

work page arXiv 2025
[13]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025. URL h...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Vladimir Pavlovic, et al. T3d: Few-step diffusion lan- guage models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

work page arXiv 2026
[15]

Ling-coder-sft

inclusionAI. Ling-coder-sft. https://huggingface.co/datasets/inclusionAI/ Ling-Coder-SFT, 2024

work page 2024
[16]

Mixchain-z-prm12k

horseee. Mixchain-z-prm12k. https://huggingface.co/datasets/horseee/ MixChain-Z-PRM12K, 2024

work page 2024
[17]

A convergence theory for diffusion language models: An information-theoretic perspective, 2025

Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An information-theoretic perspective, 2025. URLhttps://arxiv.org/abs/2505.21400. 13

work page arXiv 2025
[18]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion, 2025. URL https://arxiv.org/abs/2506.17298

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Large language models are overconfident and amplify human bias, 2025

Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. Large language models are overconfident and amplify human bias, 2025. URLhttps://arxiv.org/abs/2505.02151

work page arXiv 2025
[20]

Yann Lecun, Sumit Chopra, and Raia Hadsell.A tutorial on energy-based learning. 01 2006

work page 2006
[21]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

work page arXiv 2025
[22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[25]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information P...

work page 2021
[27]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[29]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

work page 2021
[30]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information 14 Proce...

work page 2022
[31]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

work page arXiv 2024
[33]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[34]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024
[35]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Gemini-diffusion, 2025

Google DeepMind. Gemini-diffusion, 2025. URL https://blog.google/technology/ google-deepmind/gemini-diffusion/

work page 2025
[37]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025. 15 A Sensitivity toλandγ We perform a 2×3 sensitivity analysis over λ∈ {1,2} and γ∈ {0.1,0.2,0.3} , where γ denotes the ma...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

OpenAI o3 and o4-mini system card.https://openai.com/index/o3-o4-mini-system- card/, 2025

OpenAI. OpenAI o3 and o4-mini system card.https://openai.com/index/o3-o4-mini-system- card/, 2025

work page 2025

[3] [3]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page

[4] [4]

URLhttps://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv. org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Principled rl for diffusion llms emerges from a sequence-level perspective, 2025

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective, 2025. URLhttps://arxiv.org/abs/2512.03759

work page arXiv 2025

[9] [9]

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, and Jing Shao. Dare: Diffusion large language models alignment and reinforcement executor, 2026. URL https: //arxiv.org/abs/2604.04215

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [11]

Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao. Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step, 2025. URL https://arxiv.org/abs/2509.23924

work page arXiv 2025

[11] [12]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, and Da Zheng. dinfer: An efficient inference framework for diffusion language models, 2025. URL htt...

work page arXiv 2025

[12] [13]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025. URL h...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [14]

T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Vladimir Pavlovic, et al. T3d: Few-step diffusion lan- guage models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

work page arXiv 2026

[14] [15]

Ling-coder-sft

inclusionAI. Ling-coder-sft. https://huggingface.co/datasets/inclusionAI/ Ling-Coder-SFT, 2024

work page 2024

[15] [16]

Mixchain-z-prm12k

horseee. Mixchain-z-prm12k. https://huggingface.co/datasets/horseee/ MixChain-Z-PRM12K, 2024

work page 2024

[16] [17]

A convergence theory for diffusion language models: An information-theoretic perspective, 2025

Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An information-theoretic perspective, 2025. URLhttps://arxiv.org/abs/2505.21400. 13

work page arXiv 2025

[17] [18]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion, 2025. URL https://arxiv.org/abs/2506.17298

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Large language models are overconfident and amplify human bias, 2025

Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. Large language models are overconfident and amplify human bias, 2025. URLhttps://arxiv.org/abs/2505.02151

work page arXiv 2025

[19] [20]

Yann Lecun, Sumit Chopra, and Raia Hadsell.A tutorial on energy-based learning. 01 2006

work page 2006

[20] [21]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

work page arXiv 2025

[21] [22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021

[24] [25]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [26]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information P...

work page 2021

[26] [27]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [28]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015

[28] [29]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

work page 2021

[29] [30]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information 14 Proce...

work page 2022

[30] [31]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [32]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

work page arXiv 2024

[32] [33]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024

[33] [34]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024

[34] [35]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [36]

Gemini-diffusion, 2025

Google DeepMind. Gemini-diffusion, 2025. URL https://blog.google/technology/ google-deepmind/gemini-diffusion/

work page 2025

[36] [37]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025. 15 A Sensitivity toλandγ We perform a 2×3 sensitivity analysis over λ∈ {1,2} and γ∈ {0.1,0.2,0.3} , where γ denotes the ma...

work page internal anchor Pith review Pith/arXiv arXiv 2025