pith. sign in

arxiv: 2507.08390 · v4 · submitted 2025-07-11 · 💻 cs.LG

Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

Pith reviewed 2026-05-19 04:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords diffusion language modelsinference-time scalingparticle Gibbs samplingtrajectory refinementsequential Monte Carloreward-guided generationGSM8K
0
0 comments X

The pith

Particle Gibbs sampling refines full denoising trajectories in diffusion language models to introduce a new scaling axis of refinement iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discrete diffusion language models match autoregressive performance after large training but lack strong inference-time methods to steer outputs toward rewards without retraining. Earlier techniques resample or filter inside one denoising path step by step. PG-DLM instead builds a Markov chain across complete trajectories and uses a conditional sequential Monte Carlo kernel to resample them. This creates an extra compute dimension—the count of refinement iterations—that continues to raise accuracy after adding more parallel particles stops helping. The method also supports adaptive compute by running extra iterations only when needed and supplies convergence and variance guarantees.

Core claim

PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them, introducing a new scaling axis (number of refinement iterations) that remains effective even as gains from adding more parallel samples saturate.

What carries the argument

Markov chain over full denoising trajectories with a conditional sequential Monte Carlo kernel for resampling.

If this is right

  • Reward-guided generation accuracy rises with additional refinement iterations after parallel sampling saturates.
  • Adaptive compute allocation becomes possible by performing extra iterations only on difficult samples.
  • Theoretical convergence and variance bounds hold for the constructed Markov chain.
  • Empirical outperformance occurs across varying compute budgets on tasks such as GSM8K math problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The trajectory-level view may extend to other discrete generative processes where step-wise resampling is currently used.
  • Future inference scaling laws could treat refinement iterations as an orthogonal axis to model size or sample count.
  • Combining trajectory refinement with light fine-tuning might produce larger gains than either alone.

Load-bearing premise

The conditional sequential Monte Carlo kernel can be implemented efficiently and the resulting Markov chain mixes sufficiently fast to deliver measurable gains within a small number of refinement iterations.

What would settle it

If accuracy on GSM8K stops improving when refinement iterations increase from a few to ten while particle count stays fixed, the claim of an effective new scaling axis would be contradicted.

Figures

Figures reproduced from arXiv: 2507.08390 by Akash Srivastava, Jiaqi Han, Kai Xu, Meihua Dang, Minkai Xu, Stefano Ermon.

Figure 1
Figure 1. Figure 1: Illustration of PG-DLM. At each iteration, a reference trajectory is fixed (top row), new [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trade-off between particle Gibbs iterations m and sample counts k across compute budgets (NFEs). The x-axis shows NFEs controlled by varying k, and the legend shows m. Increasing k (with m= 1) performs best in low-NFE regimes. However, as samples saturate, additional iterations (m= 2, 4) become more effective. m k Toxicity 1 32 90.3 2 16 93.6 4 8 91.7 1 64 96.3 2 32 97.0 4 16 97.6 [PITH_FULL_IMAGE:figures… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs). For (a) LLaDA, the x-axis shows NFEs controlled by varying k, with T in the legend; for (b-d) MDLM, the x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. Denoising Steps vs. Sample Count. In masked diffusion … view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of ReMDM and vanilla MDLM backward processes under varying compute budgets (NFEs). The x-axis shows NFEs, controlled by varying the number of samples k, while the legend shows denoising steps T ∈ {128, 256, 512}. ReMDM consistently achieves higher accuracies, demonstrating the effectiveness of improved backward transition dynamics. 6 RELATED WORK Inference-time scaling has been extensively studi… view at source ↗
Figure 7
Figure 7. Figure 7: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs) for SMC (ϕ = 1). The x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. 2. For SMC with number of x0 samples ϕ = 4: 10 4 10 5 Inference Compute (NFE) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Toxicity k=2 k=4 k=8 k=16 k… view at source ↗
Figure 8
Figure 8. Figure 8: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs) for SMC (ϕ = 4). The x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. 3. For BON: 10 3 10 4 Inference Compute (NFE) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Toxicity k=2 k=4 k=8 k=16 k=32 10 3 10 4 Inference Compute … view at source ↗
Figure 9
Figure 9. Figure 9: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs) for BON. The x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of Beam and Random sampling for partial reward estimation with varying [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), an inference-time algorithm enabling trajectory-level refinement. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. By doing so, PG-DLM introduces a new scaling axis, the number of refinement iterations, which is unavailable to prior methods. Increasing iterations remains effective even as gains from adding more parallel samples saturate. Furthermore, PG-DLM enables adaptive compute allocation by performing additional iterations only when needed, leading to further efficiency gains. We derive theoretical guarantees for convergence and variance bounds, and analyze trade-offs across different scaling axes. Empirically, PG-DLM outperforms prior methods across compute budgets on reward-guided generation tasks. On GSM8K, it achieves 90.07% accuracy with 2.9 particles on average and 94.47% accuracy with 16 particles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PG-DLM, a particle Gibbs sampling method for diffusion language models. It constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to enable trajectory-level refinement. This adds the number of refinement iterations as a new inference-time scaling axis that remains effective after parallel sampling saturates. Theoretical convergence and variance bounds are derived from SMC/MCMC theory, and experiments show outperformance on reward-guided tasks, including 90.07% accuracy on GSM8K with 2.9 particles on average and 94.47% with 16 particles.

Significance. If the conditional SMC kernel mixes sufficiently fast and can be implemented efficiently, the work provides a new, adaptive compute axis for steering discrete diffusion LMs at inference time without retraining. The theoretical guarantees and concrete GSM8K results strengthen the case for trajectory-level methods over step-wise resampling, with potential efficiency gains from adaptive iteration allocation.

major comments (2)
  1. [Abstract and theoretical section] Abstract and theoretical section: the convergence and variance bounds rely on standard SMC theory applied to the diffusion process, yet no mixing-time analysis, autocorrelation plots, or effective sample size curves versus refinement iteration count are provided to substantiate that the conditional SMC kernel mixes rapidly enough in the discrete high-dimensional trajectory space for the claimed gains with small iteration counts.
  2. [Empirical evaluation on GSM8K] Empirical evaluation on GSM8K: the reported accuracies (90.07% with average 2.9 particles, 94.47% with 16 particles) are presented without full specification of the precise SMC kernel form or any post-hoc tuning of particle counts and iteration schedules, leaving open the possibility that performance depends on implementation choices not visible in the manuscript.
minor comments (2)
  1. [Methods section] Methods section: provide pseudocode or a clear algorithmic description of the conditional SMC kernel and the overall PG-DLM procedure to improve reproducibility.
  2. [Figures comparing scaling axes] Figures comparing scaling axes: include error bars or multiple random seeds to allow assessment of variability in the trade-off between refinement iterations and parallel samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each major comment below and indicate the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Abstract and theoretical section] Abstract and theoretical section: the convergence and variance bounds rely on standard SMC theory applied to the diffusion process, yet no mixing-time analysis, autocorrelation plots, or effective sample size curves versus refinement iteration count are provided to substantiate that the conditional SMC kernel mixes rapidly enough in the discrete high-dimensional trajectory space for the claimed gains with small iteration counts.

    Authors: We acknowledge that the manuscript presents convergence and variance bounds derived from standard SMC/MCMC theory without accompanying empirical diagnostics such as mixing-time analysis or effective sample size curves. While the theoretical results guarantee asymptotic correctness independent of mixing rate, we agree that empirical evidence would better substantiate the practical gains observed with small iteration counts. In the revised manuscript we will add effective sample size curves and autocorrelation plots versus refinement iteration count for the GSM8K experiments to illustrate the mixing behavior of the conditional SMC kernel. revision: yes

  2. Referee: [Empirical evaluation on GSM8K] Empirical evaluation on GSM8K: the reported accuracies (90.07% with average 2.9 particles, 94.47% with 16 particles) are presented without full specification of the precise SMC kernel form or any post-hoc tuning of particle counts and iteration schedules, leaving open the possibility that performance depends on implementation choices not visible in the manuscript.

    Authors: We thank the referee for highlighting the need for greater implementation transparency. The manuscript currently describes the PG-DLM procedure at the algorithmic level. To improve reproducibility, the revised version will provide the exact mathematical form of the conditional SMC kernel, pseudocode for the resampling step, and the precise particle counts together with the iteration schedules employed in the GSM8K experiments. We confirm that the reported accuracies correspond to these configurations without additional undisclosed tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard SMC theory independently

full rationale

The paper introduces PG-DLM by constructing a Markov chain over full denoising trajectories and applying a conditional sequential Monte Carlo kernel, deriving convergence and variance bounds from established SMC/MCMC theory rather than internal fits or self-citations. No equations reduce reported performance or scaling claims to quantities defined or fitted inside the paper itself. The new scaling axis (refinement iterations) is presented as a direct consequence of the trajectory-level resampling, with empirical results on tasks like GSM8K treated as external validation. The derivation remains self-contained against external benchmarks such as standard SMC convergence results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on standard diffusion Markov chain assumptions and SMC convergence theory; the main free parameters are the number of particles and refinement iterations, which are varied experimentally rather than derived.

free parameters (2)
  • number of particles
    Controls parallel diversity and is reported as 2.9 on average for 90% GSM8K accuracy; chosen to balance compute and performance.
  • number of refinement iterations
    New scaling axis; increased until gains saturate or adaptive stopping is triggered.
axioms (1)
  • domain assumption The discrete diffusion process induces a valid Markov chain on full trajectories that admits a conditional SMC kernel.
    Invoked when constructing the Markov chain over denoising trajectories.

pith-pipeline@v0.9.0 · 5772 in / 1344 out tokens · 42158 ms · 2026-05-19T04:58:22.773215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

  2. Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast

    cs.CL 2026-05 unverdicted novelty 7.0

    FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.

  3. Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

    cs.CV 2025-12 unverdicted novelty 7.0

    IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming pri...

  4. ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

    cs.LG 2026-01 conditional novelty 6.0

    ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.

  5. ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

    cs.LG 2026-01 unverdicted novelty 6.0

    ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 4 Pith papers · 7 internal anchors

  1. [1]

    Particle markov chain monte carlo methods

    Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72 0 (3): 0 269--342, 2010

  2. [2]

    Structured denoising diffusion models in discrete state-spaces

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in neural information processing systems, volume 34, pages 17981--17993, 2021

  3. [3]

    Tweeteval: Unified benchmark and comparative evaluation for tweet classification

    Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650, 2020

  4. [4]

    Scaling test-time compute with open models

    Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling test-time compute with open models. URL: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute, 2024

  5. [5]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In International Conference on Learning Representations, 2024

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  7. [7]

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

  8. [8]

    Plug and play language models: A simple approach to controlled text generation

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020

  9. [9]

    Diffusion posterior sampling for linear inverse problem solving: A filtering perspective

    Zehao Dou and Yang Song. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In International Conference on Learning Representations, 2024

  10. [10]

    An introduction to sequential monte carlo methods

    Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pages 3--14, 2001

  11. [11]

    Reinforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. In Advances in Neural Information Processing Systems, volume 36, 2024

  12. [12]

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. In Advances in Neural Information Processing Systems, volume 37, 2024

  13. [13]

    Openwebtext corpus

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

  14. [14]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. In International Conference on Learning Representations, 2025

  15. [15]

    Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

    Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575--11596, 2023

  16. [16]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pages 6840--6851, 2020

  17. [17]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In Advances in Neural Information Processing Systems, volume 35, pages 8633--8646, 2022

  18. [18]

    Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control

    Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jos \'e Miguel Hern \'a ndez-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645--1654, 2017

  19. [19]

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

  20. [20]

    Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

    Jaihoon Kim, Taehoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. arXiv preprint arXiv:2503.19385, 2025

  21. [21]

    Rl with kl penalties is better viewed as bayesian inference

    Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083--1091, 2022

  22. [22]

    Sequential monte carlo steering of large language models using probabilistic programs,

    Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K Mansinghka. Sequential monte carlo steering of large language models using probabilistic programs. arXiv preprint arXiv:2306.03081, 2023

  23. [23]

    Discrete predictor-corrector diffusion models for image synthesis

    Jose Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, and Irfan Essa. Discrete predictor-corrector diffusion models for image synthesis. In International Conference on Learning Representations, 2022

  24. [24]

    arXiv preprint arXiv:2408.08252 , year =

    Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, et al. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. arXiv preprint arXiv:2408.08252, 2024

  25. [25]

    Particle gibbs with ancestor sampling

    Fredrik Lindsten, Michael I Jordan, and Thomas B Sch \"o n. Particle gibbs with ancestor sampling. The Journal of Machine Learning Research, 15 0 (1): 0 2145--2184, 2014

  26. [26]

    Paradetox: Detoxification with parallel data

    Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. Paradetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804--6818, 2022

  27. [27]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In International Conference on Machine Learning, 2023

  28. [28]

    Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025

  29. [29]

    Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

    John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119--126, 2020

  30. [30]

    Elements of sequential monte carlo

    Christian A Naesseth, Fredrik Lindsten, Thomas B Sch \"o n, et al. Elements of sequential monte carlo. Foundations and Trends in Machine Learning , 12 0 (3): 0 307--392, 2019

  31. [31]

    Particle-filtering-based latent diffusion for inverse problems

    Amir Nazemi, Mohammad Hadi Sepanj, Nicholas Pellegrino, Chris Czarnecki, and Paul Fieguth. Particle-filtering-based latent diffusion for inverse problems. arXiv preprint arXiv:2408.13868, 2024

  32. [32]

    Scaling up masked diffusion models on text

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. In International Conference on Learning Representations, 2025 a

  33. [33]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

  34. [34]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in neural information processing systems, volume 35, pages 27730--27744, 2022

  35. [35]

    W\"urstchen: An efficient architecture for large-scale text-to-image diffusion models

    Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W\"urstchen: An efficient architecture for large-scale text-to-image diffusion models. In International Conference on Learning Representations, 2024

  36. [36]

    SDXL : Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. SDXL : Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024

  37. [37]

    A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods

    Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods. arXiv preprint arXiv:2502.01618, 2025

  38. [38]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  39. [39]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36, 2024

  40. [40]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  41. [41]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684--10695, 2022

  42. [42]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...

  43. [43]

    Simple and effective masked diffusion language models

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In Advances in Neural Information Processing Systems, volume 37, pages 130136--130184, 2024

  44. [44]

    Simple guidance mechanisms for discrete diffusion models

    Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations, 2025

  45. [45]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. In Advances in neural information processing systems, volume 37, pages 103131--103167, 2024

  46. [46]

    A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. arXiv preprint arXiv:2501.06848, 2025

  47. [47]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  48. [48]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256--2265, 2015

  49. [49]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

  50. [50]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  51. [51]

    Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

    Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine-tuning of continuous-time diffusion models as entropy-regularized control. arXiv preprint arXiv:2402.15194, 2024 a

  52. [52]

    Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models

    Masatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gokcen Eraslan, Avantika Lal, Sergey Levine, and Tommaso Biancalani. Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models. In Advances in Neural Information Processing Systems, volume 37, pages 127511--127535, 2024 b

  53. [53]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228--8238, 2024

  54. [54]

    S., and Kuleshov, V

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. arXiv preprint arXiv:2503.00307, 2025

  55. [55]

    Neural network acceptability judgments

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. In Transactions of the Association for Computational Linguistics, volume 7, pages 625--641. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2019

  56. [56]

    Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei

    Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in Neural Information Processing Systems, 2023

  57. [57]

    Geodiff: A geometric diffusion model for molecular conformation generation

    Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2022

  58. [58]

    Informed correctors for discrete diffusion models, 2025

    Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models. arXiv preprint arXiv:2407.21243, 2024

  59. [59]

    Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, 2025