Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

Akash Srivastava; Jiaqi Han; Kai Xu; Meihua Dang; Minkai Xu; Stefano Ermon

arxiv: 2507.08390 · v4 · submitted 2025-07-11 · 💻 cs.LG

Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

Meihua Dang , Jiaqi Han , Minkai Xu , Kai Xu , Akash Srivastava , Stefano Ermon This is my paper

Pith reviewed 2026-05-19 04:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion language modelsinference-time scalingparticle Gibbs samplingtrajectory refinementsequential Monte Carloreward-guided generationGSM8K

0 comments

The pith

Particle Gibbs sampling refines full denoising trajectories in diffusion language models to introduce a new scaling axis of refinement iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discrete diffusion language models match autoregressive performance after large training but lack strong inference-time methods to steer outputs toward rewards without retraining. Earlier techniques resample or filter inside one denoising path step by step. PG-DLM instead builds a Markov chain across complete trajectories and uses a conditional sequential Monte Carlo kernel to resample them. This creates an extra compute dimension—the count of refinement iterations—that continues to raise accuracy after adding more parallel particles stops helping. The method also supports adaptive compute by running extra iterations only when needed and supplies convergence and variance guarantees.

Core claim

PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them, introducing a new scaling axis (number of refinement iterations) that remains effective even as gains from adding more parallel samples saturate.

What carries the argument

Markov chain over full denoising trajectories with a conditional sequential Monte Carlo kernel for resampling.

If this is right

Reward-guided generation accuracy rises with additional refinement iterations after parallel sampling saturates.
Adaptive compute allocation becomes possible by performing extra iterations only on difficult samples.
Theoretical convergence and variance bounds hold for the constructed Markov chain.
Empirical outperformance occurs across varying compute budgets on tasks such as GSM8K math problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The trajectory-level view may extend to other discrete generative processes where step-wise resampling is currently used.
Future inference scaling laws could treat refinement iterations as an orthogonal axis to model size or sample count.
Combining trajectory refinement with light fine-tuning might produce larger gains than either alone.

Load-bearing premise

The conditional sequential Monte Carlo kernel can be implemented efficiently and the resulting Markov chain mixes sufficiently fast to deliver measurable gains within a small number of refinement iterations.

What would settle it

If accuracy on GSM8K stops improving when refinement iterations increase from a few to ten while particle count stays fixed, the claim of an effective new scaling axis would be contradicted.

Figures

Figures reproduced from arXiv: 2507.08390 by Akash Srivastava, Jiaqi Han, Kai Xu, Meihua Dang, Minkai Xu, Stefano Ermon.

**Figure 2.** Figure 2: Trade-off between particle Gibbs iterations m and sample counts k across compute budgets (NFEs). The x-axis shows NFEs controlled by varying k, and the legend shows m. Increasing k (with m= 1) performs best in low-NFE regimes. However, as samples saturate, additional iterations (m= 2, 4) become more effective. m k Toxicity 1 32 90.3 2 16 93.6 4 8 91.7 1 64 96.3 2 32 97.0 4 16 97.6 [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs). For (a) LLaDA, the x-axis shows NFEs controlled by varying k, with T in the legend; for (b-d) MDLM, the x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. Denoising Steps vs. Sample Count. In masked diffusion … view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of ReMDM and vanilla MDLM backward processes under varying compute budgets (NFEs). The x-axis shows NFEs, controlled by varying the number of samples k, while the legend shows denoising steps T ∈ {128, 256, 512}. ReMDM consistently achieves higher accuracies, demonstrating the effectiveness of improved backward transition dynamics. 6 RELATED WORK Inference-time scaling has been extensively studi… view at source ↗

**Figure 7.** Figure 7: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs) for SMC (ϕ = 1). The x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. 2. For SMC with number of x0 samples ϕ = 4: 10 4 10 5 Inference Compute (NFE) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Toxicity k=2 k=4 k=8 k=16 k… view at source ↗

**Figure 8.** Figure 8: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs) for SMC (ϕ = 4). The x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. 3. For BON: 10 3 10 4 Inference Compute (NFE) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Toxicity k=2 k=4 k=8 k=16 k=32 10 3 10 4 Inference Compute … view at source ↗

**Figure 9.** Figure 9: Trade-offs between sample counts k and denoising steps T across compute budgets (NFEs) for BON. The x-axis shows NFEs controlled by varying T, with k in the legend. Scaling k (and decreasing T accordingly) generally yields better performance under the same NFEs. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Beam and Random sampling for partial reward estimation with varying [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), an inference-time algorithm enabling trajectory-level refinement. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. By doing so, PG-DLM introduces a new scaling axis, the number of refinement iterations, which is unavailable to prior methods. Increasing iterations remains effective even as gains from adding more parallel samples saturate. Furthermore, PG-DLM enables adaptive compute allocation by performing additional iterations only when needed, leading to further efficiency gains. We derive theoretical guarantees for convergence and variance bounds, and analyze trade-offs across different scaling axes. Empirically, PG-DLM outperforms prior methods across compute budgets on reward-guided generation tasks. On GSM8K, it achieves 90.07% accuracy with 2.9 particles on average and 94.47% accuracy with 16 particles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PG-DLM adds a refinement-iteration scaling axis to diffusion LM inference by running conditional SMC over full trajectories, with solid GSM8K numbers, but the mixing speed claim rests on unshown diagnostics in discrete space.

read the letter

The core idea is straightforward: instead of resampling inside one denoising path, they run particle Gibbs over complete trajectories and use a conditional SMC kernel to refine them. This creates an extra compute knob (number of refinement iterations) that they say keeps helping after adding more parallel particles stops paying off. On GSM8K they report 90% accuracy with roughly 2.9 particles on average and 94% with 16, plus some adaptive allocation that only spends extra iterations when needed. That part looks useful and the adaptation of standard SMC machinery to diffusion LMs is a clean move that prior step-wise methods did not have.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PG-DLM, a particle Gibbs sampling method for diffusion language models. It constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to enable trajectory-level refinement. This adds the number of refinement iterations as a new inference-time scaling axis that remains effective after parallel sampling saturates. Theoretical convergence and variance bounds are derived from SMC/MCMC theory, and experiments show outperformance on reward-guided tasks, including 90.07% accuracy on GSM8K with 2.9 particles on average and 94.47% with 16 particles.

Significance. If the conditional SMC kernel mixes sufficiently fast and can be implemented efficiently, the work provides a new, adaptive compute axis for steering discrete diffusion LMs at inference time without retraining. The theoretical guarantees and concrete GSM8K results strengthen the case for trajectory-level methods over step-wise resampling, with potential efficiency gains from adaptive iteration allocation.

major comments (2)

[Abstract and theoretical section] Abstract and theoretical section: the convergence and variance bounds rely on standard SMC theory applied to the diffusion process, yet no mixing-time analysis, autocorrelation plots, or effective sample size curves versus refinement iteration count are provided to substantiate that the conditional SMC kernel mixes rapidly enough in the discrete high-dimensional trajectory space for the claimed gains with small iteration counts.
[Empirical evaluation on GSM8K] Empirical evaluation on GSM8K: the reported accuracies (90.07% with average 2.9 particles, 94.47% with 16 particles) are presented without full specification of the precise SMC kernel form or any post-hoc tuning of particle counts and iteration schedules, leaving open the possibility that performance depends on implementation choices not visible in the manuscript.

minor comments (2)

[Methods section] Methods section: provide pseudocode or a clear algorithmic description of the conditional SMC kernel and the overall PG-DLM procedure to improve reproducibility.
[Figures comparing scaling axes] Figures comparing scaling axes: include error bars or multiple random seeds to allow assessment of variability in the trade-off between refinement iterations and parallel samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each major comment below and indicate the revisions we plan to incorporate.

read point-by-point responses

Referee: [Abstract and theoretical section] Abstract and theoretical section: the convergence and variance bounds rely on standard SMC theory applied to the diffusion process, yet no mixing-time analysis, autocorrelation plots, or effective sample size curves versus refinement iteration count are provided to substantiate that the conditional SMC kernel mixes rapidly enough in the discrete high-dimensional trajectory space for the claimed gains with small iteration counts.

Authors: We acknowledge that the manuscript presents convergence and variance bounds derived from standard SMC/MCMC theory without accompanying empirical diagnostics such as mixing-time analysis or effective sample size curves. While the theoretical results guarantee asymptotic correctness independent of mixing rate, we agree that empirical evidence would better substantiate the practical gains observed with small iteration counts. In the revised manuscript we will add effective sample size curves and autocorrelation plots versus refinement iteration count for the GSM8K experiments to illustrate the mixing behavior of the conditional SMC kernel. revision: yes
Referee: [Empirical evaluation on GSM8K] Empirical evaluation on GSM8K: the reported accuracies (90.07% with average 2.9 particles, 94.47% with 16 particles) are presented without full specification of the precise SMC kernel form or any post-hoc tuning of particle counts and iteration schedules, leaving open the possibility that performance depends on implementation choices not visible in the manuscript.

Authors: We thank the referee for highlighting the need for greater implementation transparency. The manuscript currently describes the PG-DLM procedure at the algorithmic level. To improve reproducibility, the revised version will provide the exact mathematical form of the conditional SMC kernel, pseudocode for the resampling step, and the precise particle counts together with the iteration schedules employed in the GSM8K experiments. We confirm that the reported accuracies correspond to these configurations without additional undisclosed tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard SMC theory independently

full rationale

The paper introduces PG-DLM by constructing a Markov chain over full denoising trajectories and applying a conditional sequential Monte Carlo kernel, deriving convergence and variance bounds from established SMC/MCMC theory rather than internal fits or self-citations. No equations reduce reported performance or scaling claims to quantities defined or fitted inside the paper itself. The new scaling axis (refinement iterations) is presented as a direct consequence of the trajectory-level resampling, with empirical results on tasks like GSM8K treated as external validation. The derivation remains self-contained against external benchmarks such as standard SMC convergence results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on standard diffusion Markov chain assumptions and SMC convergence theory; the main free parameters are the number of particles and refinement iterations, which are varied experimentally rather than derived.

free parameters (2)

number of particles
Controls parallel diversity and is reported as 2.9 on average for 90% GSM8K accuracy; chosen to balance compute and performance.
number of refinement iterations
New scaling axis; increased until gains saturate or adaptive stopping is triggered.

axioms (1)

domain assumption The discrete diffusion process induces a valid Markov chain on full trajectories that admits a conditional SMC kernel.
Invoked when constructing the Markov chain over denoising trajectories.

pith-pipeline@v0.9.0 · 5772 in / 1344 out tokens · 42158 ms · 2026-05-19T04:58:22.773215+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
cs.CV 2025-12 unverdicted novelty 7.0

IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming pri...
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 conditional novelty 6.0

ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 unverdicted novelty 6.0

ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 4 Pith papers · 7 internal anchors

[1]

Particle markov chain monte carlo methods

Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72 0 (3): 0 269--342, 2010

work page 2010
[2]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in neural information processing systems, volume 34, pages 17981--17993, 2021

work page 2021
[3]

Tweeteval: Unified benchmark and comparative evaluation for tweet classification

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650, 2020

work page 2020
[4]

Scaling test-time compute with open models

Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling test-time compute with open models. URL: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute, 2024

work page 2024
[5]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In International Conference on Learning Representations, 2024

work page 2024
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020

work page 2020
[9]

Diffusion posterior sampling for linear inverse problem solving: A filtering perspective

Zehao Dou and Yang Song. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In International Conference on Learning Representations, 2024

work page 2024
[10]

An introduction to sequential monte carlo methods

Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pages 3--14, 2001

work page 2001
[11]

Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. In Advances in Neural Information Processing Systems, volume 36, 2024

work page 2024
[12]

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[13]

Openwebtext corpus

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[14]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. In International Conference on Learning Representations, 2025

work page 2025
[15]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575--11596, 2023

work page 2023
[16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pages 6840--6851, 2020

work page 2020
[17]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In Advances in Neural Information Processing Systems, volume 35, pages 8633--8646, 2022

work page 2022
[18]

Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jos \'e Miguel Hern \'a ndez-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645--1654, 2017

work page 2017
[19]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[20]

Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

Jaihoon Kim, Taehoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. arXiv preprint arXiv:2503.19385, 2025

work page arXiv 2025
[21]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083--1091, 2022

work page 2022
[22]

Sequential monte carlo steering of large language models using probabilistic programs,

Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K Mansinghka. Sequential monte carlo steering of large language models using probabilistic programs. arXiv preprint arXiv:2306.03081, 2023

work page arXiv 2023
[23]

Discrete predictor-corrector diffusion models for image synthesis

Jose Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, and Irfan Essa. Discrete predictor-corrector diffusion models for image synthesis. In International Conference on Learning Representations, 2022

work page 2022
[24]

arXiv preprint arXiv:2408.08252 , year =

Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, et al. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. arXiv preprint arXiv:2408.08252, 2024

work page arXiv 2024
[25]

Particle gibbs with ancestor sampling

Fredrik Lindsten, Michael I Jordan, and Thomas B Sch \"o n. Particle gibbs with ancestor sampling. The Journal of Machine Learning Research, 15 0 (1): 0 2145--2184, 2014

work page 2014
[26]

Paradetox: Detoxification with parallel data

Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. Paradetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804--6818, 2022

work page 2022
[27]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In International Conference on Machine Learning, 2023

work page 2023
[28]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025

work page internal anchor Pith review arXiv 2025
[29]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119--126, 2020

work page 2020
[30]

Elements of sequential monte carlo

Christian A Naesseth, Fredrik Lindsten, Thomas B Sch \"o n, et al. Elements of sequential monte carlo. Foundations and Trends in Machine Learning , 12 0 (3): 0 307--392, 2019

work page 2019
[31]

Particle-filtering-based latent diffusion for inverse problems

Amir Nazemi, Mohammad Hadi Sepanj, Nicholas Pellegrino, Chris Czarnecki, and Paul Fieguth. Particle-filtering-based latent diffusion for inverse problems. arXiv preprint arXiv:2408.13868, 2024

work page arXiv 2024
[32]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. In International Conference on Learning Representations, 2025 a

work page 2025
[33]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in neural information processing systems, volume 35, pages 27730--27744, 2022

work page 2022
[35]

W\"urstchen: An efficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W\"urstchen: An efficient architecture for large-scale text-to-image diffusion models. In International Conference on Learning Representations, 2024

work page 2024
[36]

SDXL : Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. SDXL : Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024

work page 2024
[37]

A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods. arXiv preprint arXiv:2502.01618, 2025

work page arXiv 2025
[38]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[39]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36, 2024

work page 2024
[40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684--10695, 2022

work page 2022
[42]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...

work page 2022
[43]

Simple and effective masked diffusion language models

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In Advances in Neural Information Processing Systems, volume 37, pages 130136--130184, 2024

work page 2024
[44]

Simple guidance mechanisms for discrete diffusion models

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations, 2025

work page 2025
[45]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. In Advances in neural information processing systems, volume 37, pages 103131--103167, 2024

work page 2024
[46]

A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. arXiv preprint arXiv:2501.06848, 2025

work page arXiv 2025
[47]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256--2265, 2015

work page 2015
[49]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

work page 2019
[50]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021
[51]

Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine-tuning of continuous-time diffusion models as entropy-regularized control. arXiv preprint arXiv:2402.15194, 2024 a

work page arXiv 2024
[52]

Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models

Masatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gokcen Eraslan, Avantika Lal, Sergey Levine, and Tommaso Biancalani. Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models. In Advances in Neural Information Processing Systems, volume 37, pages 127511--127535, 2024 b

work page 2024
[53]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228--8238, 2024

work page 2024
[54]

S., and Kuleshov, V

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. arXiv preprint arXiv:2503.00307, 2025

work page arXiv 2025
[55]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. In Transactions of the Association for Computational Linguistics, volume 7, pages 625--641. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2019

work page 2019
[56]

Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei

Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in Neural Information Processing Systems, 2023

work page 2023
[57]

Geodiff: A geometric diffusion model for molecular conformation generation

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2022

work page 2022
[58]

Informed correctors for discrete diffusion models, 2025

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models. arXiv preprint arXiv:2407.21243, 2024

work page arXiv 2024
[59]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, 2025

work page 2025

[1] [1]

Particle markov chain monte carlo methods

Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72 0 (3): 0 269--342, 2010

work page 2010

[2] [2]

Structured denoising diffusion models in discrete state-spaces

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in neural information processing systems, volume 34, pages 17981--17993, 2021

work page 2021

[3] [3]

Tweeteval: Unified benchmark and comparative evaluation for tweet classification

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650, 2020

work page 2020

[4] [4]

Scaling test-time compute with open models

Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling test-time compute with open models. URL: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute, 2024

work page 2024

[5] [5]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In International Conference on Learning Representations, 2024

work page 2024

[6] [6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020

work page 2020

[9] [9]

Diffusion posterior sampling for linear inverse problem solving: A filtering perspective

Zehao Dou and Yang Song. Diffusion posterior sampling for linear inverse problem solving: A filtering perspective. In International Conference on Learning Representations, 2024

work page 2024

[10] [10]

An introduction to sequential monte carlo methods

Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice, pages 3--14, 2001

work page 2001

[11] [11]

Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. In Advances in Neural Information Processing Systems, volume 36, 2024

work page 2024

[12] [12]

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[13] [13]

Openwebtext corpus

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019

[14] [14]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. In International Conference on Learning Representations, 2025

work page 2025

[15] [15]

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11575--11596, 2023

work page 2023

[16] [16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pages 6840--6851, 2020

work page 2020

[17] [17]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In Advances in Neural Information Processing Systems, volume 35, pages 8633--8646, 2022

work page 2022

[18] [18]

Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jos \'e Miguel Hern \'a ndez-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645--1654, 2017

work page 2017

[19] [19]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[20] [20]

Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

Jaihoon Kim, Taehoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. arXiv preprint arXiv:2503.19385, 2025

work page arXiv 2025

[21] [21]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083--1091, 2022

work page 2022

[22] [22]

Sequential monte carlo steering of large language models using probabilistic programs,

Alexander K Lew, Tan Zhi-Xuan, Gabriel Grand, and Vikash K Mansinghka. Sequential monte carlo steering of large language models using probabilistic programs. arXiv preprint arXiv:2306.03081, 2023

work page arXiv 2023

[23] [23]

Discrete predictor-corrector diffusion models for image synthesis

Jose Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, and Irfan Essa. Discrete predictor-corrector diffusion models for image synthesis. In International Conference on Learning Representations, 2022

work page 2022

[24] [24]

arXiv preprint arXiv:2408.08252 , year =

Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, et al. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. arXiv preprint arXiv:2408.08252, 2024

work page arXiv 2024

[25] [25]

Particle gibbs with ancestor sampling

Fredrik Lindsten, Michael I Jordan, and Thomas B Sch \"o n. Particle gibbs with ancestor sampling. The Journal of Machine Learning Research, 15 0 (1): 0 2145--2184, 2014

work page 2014

[26] [26]

Paradetox: Detoxification with parallel data

Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. Paradetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804--6818, 2022

work page 2022

[27] [27]

Discrete diffusion modeling by estimating the ratios of the data distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In International Conference on Machine Learning, 2023

work page 2023

[28] [28]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025

work page internal anchor Pith review arXiv 2025

[29] [29]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119--126, 2020

work page 2020

[30] [30]

Elements of sequential monte carlo

Christian A Naesseth, Fredrik Lindsten, Thomas B Sch \"o n, et al. Elements of sequential monte carlo. Foundations and Trends in Machine Learning , 12 0 (3): 0 307--392, 2019

work page 2019

[31] [31]

Particle-filtering-based latent diffusion for inverse problems

Amir Nazemi, Mohammad Hadi Sepanj, Nicholas Pellegrino, Chris Czarnecki, and Paul Fieguth. Particle-filtering-based latent diffusion for inverse problems. arXiv preprint arXiv:2408.13868, 2024

work page arXiv 2024

[32] [32]

Scaling up masked diffusion models on text

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. In International Conference on Learning Representations, 2025 a

work page 2025

[33] [33]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in neural information processing systems, volume 35, pages 27730--27744, 2022

work page 2022

[35] [35]

W\"urstchen: An efficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W\"urstchen: An efficient architecture for large-scale text-to-image diffusion models. In International Conference on Learning Representations, 2024

work page 2024

[36] [36]

SDXL : Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. SDXL : Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024

work page 2024

[37] [37]

A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods. arXiv preprint arXiv:2502.01618, 2025

work page arXiv 2025

[38] [38]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019

[39] [39]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36, 2024

work page 2024

[40] [40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684--10695, 2022

work page 2022

[42] [42]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...

work page 2022

[43] [43]

Simple and effective masked diffusion language models

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. In Advances in Neural Information Processing Systems, volume 37, pages 130136--130184, 2024

work page 2024

[44] [44]

Simple guidance mechanisms for discrete diffusion models

Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P de Almeida, Alexander Rush, Thomas Pierrot, and Volodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. In International Conference on Learning Representations, 2025

work page 2025

[45] [45]

Simplified and generalized masked diffusion for discrete data

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. In Advances in neural information processing systems, volume 37, pages 103131--103167, 2024

work page 2024

[46] [46]

A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. arXiv preprint arXiv:2501.06848, 2025

work page arXiv 2025

[47] [47]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256--2265, 2015

work page 2015

[49] [49]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

work page 2019

[50] [50]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021

[51] [51]

Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine-tuning of continuous-time diffusion models as entropy-regularized control. arXiv preprint arXiv:2402.15194, 2024 a

work page arXiv 2024

[52] [52]

Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models

Masatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gokcen Eraslan, Avantika Lal, Sergey Levine, and Tommaso Biancalani. Bridging model-based optimization and generative modeling via conservative fine-tuning of diffusion models. In Advances in Neural Information Processing Systems, volume 37, pages 127511--127535, 2024 b

work page 2024

[53] [53]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228--8238, 2024

work page 2024

[54] [54]

S., and Kuleshov, V

Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. arXiv preprint arXiv:2503.00307, 2025

work page arXiv 2025

[55] [55]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. In Transactions of the Association for Computational Linguistics, volume 7, pages 625--641. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2019

work page 2019

[56] [56]

Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei

Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in Neural Information Processing Systems, 2023

work page 2023

[57] [57]

Geodiff: A geometric diffusion model for molecular conformation generation

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2022

work page 2022

[58] [58]

Informed correctors for discrete diffusion models, 2025

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, and Scott Linderman. Informed correctors for discrete diffusion models. arXiv preprint arXiv:2407.21243, 2024

work page arXiv 2024

[59] [59]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. In International Conference on Learning Representations, 2025

work page 2025