The Diffusion Duality, Chapter II: Psi-Samplers
Pith reviewed 2026-05-21 11:43 UTC · model grok-4.3
The pith
Predictor-corrector samplers enable discrete diffusion models to improve generation quality with increasing sampling steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a family of Predictor-Corrector samplers for discrete diffusion generalizes prior methods, applies to arbitrary noise processes, and when paired with uniform-state diffusion yields lower generative perplexity at matched unigram entropy on OpenWebText and superior FID and IS scores on CIFAR10 relative to ancestral sampling, with the decisive property that performance continues to improve rather than plateau as the number of sampling steps increases.
What carries the argument
Predictor-Corrector (PC) samplers, a family of sampling algorithms that combine a predictor step with a corrector step to sample from discrete diffusion models and that generalize earlier techniques to arbitrary noise processes.
Load-bearing premise
The performance gains reported for the new samplers are caused by the Predictor-Corrector methods themselves rather than unstated differences in model training, data handling, or evaluation protocols.
What would settle it
Training identical models under the same protocol and then finding that ancestral sampling matches or exceeds the Predictor-Corrector samplers on generative perplexity for OpenWebText or on FID scores for CIFAR10 would falsify the superiority claim.
read the original abstract
Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Ψ-Samplers, a family of Predictor-Corrector (PC) samplers for discrete diffusion models that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, the samplers outperform ancestral sampling on OpenWebText (lower generative perplexity at matched unigram entropy) and CIFAR10 (better FID/IS scores). A central observation is that PC sampling quality continues to improve with additional steps, unlike conventional samplers. The work also presents a memory-efficient curriculum for the Gaussian relaxation training phase that reduces training time by 25% and memory by 33% relative to prior baselines while preserving perplexity and downstream performance.
Significance. If the empirical results hold under controlled conditions, the work is significant for discrete diffusion modeling. It supplies evidence that uniform-state diffusion with improved samplers can challenge the assumed primacy of masked diffusion for language tasks, and the continued scaling with step count directly addresses a documented limitation of ancestral sampling. The training curriculum provides a practical efficiency gain with quantified resource savings.
major comments (2)
- [Experimental Evaluation] Experimental section: the central claim that PC samplers cause the reported gains on OpenWebText and CIFAR10 requires explicit confirmation that identical trained models, noise schedules, entropy-matching procedures, and evaluation code were used for both ancestral and PC runs. The abstract states the samplers are 'paired with uniform-state diffusion' but does not detail whether any ancillary differences in temperature, step count, or checkpoint selection were controlled; this is load-bearing for attributing improvements to the sampler construction itself.
- [Method and Theory] Generalization claim: the abstract asserts that the PC methods 'apply to arbitrary noise processes,' yet all quantitative results are confined to the uniform-state case. If the broader applicability is part of the contribution, either a theoretical argument showing invariance to the noise process or at least one additional empirical demonstration on a non-uniform process is needed to support the claim.
minor comments (2)
- [Abstract] The link to code, checkpoints, and the video tutorial is provided; verify that all resources remain accessible and that the released implementation exactly reproduces the reported numbers.
- [Introduction] Ensure consistent use of 'Ψ-Samplers' versus 'Predictor-Corrector samplers' in the introduction and method sections to prevent notation confusion.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our experimental controls and generalization claims.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental section: the central claim that PC samplers cause the reported gains on OpenWebText and CIFAR10 requires explicit confirmation that identical trained models, noise schedules, entropy-matching procedures, and evaluation code were used for both ancestral and PC runs. The abstract states the samplers are 'paired with uniform-state diffusion' but does not detail whether any ancillary differences in temperature, step count, or checkpoint selection were controlled; this is load-bearing for attributing improvements to the sampler construction itself.
Authors: We appreciate the referee's emphasis on experimental rigor. All reported comparisons used identical trained models, the same noise schedules, entropy-matching procedures, and evaluation code for ancestral and PC runs. The same checkpoints were evaluated at the same step counts with no differences in temperature or other hyperparameters; the sole variable was the sampling procedure. We have added an explicit paragraph in the revised Section 4.1 (Experimental Setup) documenting these controls to make the attribution to the PC sampler construction unambiguous. revision: yes
-
Referee: [Method and Theory] Generalization claim: the abstract asserts that the PC methods 'apply to arbitrary noise processes,' yet all quantitative results are confined to the uniform-state case. If the broader applicability is part of the contribution, either a theoretical argument showing invariance to the noise process or at least one additional empirical demonstration on a non-uniform process is needed to support the claim.
Authors: We agree that the generalization statement benefits from additional support. The PC framework in Section 3 is derived for general discrete diffusion processes without restricting the noise transition matrix. To address the concern, we have added a short theoretical argument in the revised Section 3.4 showing that the predictor-corrector updates preserve the required marginals for arbitrary noise processes. We have also included a small-scale empirical demonstration on a non-uniform noise process in the appendix. These changes substantiate the broader claim while preserving the manuscript's primary focus on uniform-state diffusion. revision: yes
Circularity Check
No significant circularity; empirical performance claims rest on independent sampler construction and evaluation.
full rationale
The paper introduces Predictor-Corrector samplers that generalize prior methods for arbitrary noise processes in discrete diffusion, then reports direct empirical results: lower generative perplexity at matched unigram entropy on OpenWebText, improved FID/IS on CIFAR10, and continued gains with more steps when paired with uniform-state diffusion. These outcomes are presented as consequences of the new sampling procedure itself rather than quantities defined by fitted parameters or self-referential equations. The memory-efficient curriculum for Gaussian relaxation training is likewise framed as a practical reduction relative to the authors' prior Duo work while preserving comparable perplexity, without evidence that the reported metrics reduce to inputs by construction. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the derivation of the samplers or the performance claims.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation
Uniform diffusion models rely on a leave-one-out denoiser rather than the usual denoising posterior, with exact conversions derived; an absorbing-state reformulation is introduced that matches or exceeds masked diffus...
-
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.