The Diffusion Duality, Chapter II: $\Psi$-Samplers

Caglar Gulcehre; Justin Deschenaux; Subham Sekhar Sahoo

REVIEW 2 major objections 2 minor 3 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Predictor-corrector samplers enable discrete diffusion models to improve generation quality with increasing sampling steps.

2026-05-21 11:43 UTC pith:H26D3UBF

load-bearing objection PC samplers for discrete diffusion keep improving with more steps and beat ancestral sampling in the reported experiments, though experimental controls need checking. the 2 major comments →

arxiv 2602.21185 v2 pith:H26D3UBF submitted 2026-02-24 cs.LG

The Diffusion Duality, Chapter II: Psi-Samplers

Justin Deschenaux , Caglar Gulcehre , Subham Sekhar Sahoo This is my paper

classification cs.LG

keywords discrete diffusionpredictor-corrector samplersuniform-state diffusionancestral samplinglanguage modelingimage generationsampling steps

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a family of Predictor-Corrector samplers for discrete diffusion models that generalize earlier approaches and work with any noise process. When combined with uniform-state diffusion, the new samplers produce lower generative perplexity on language modeling benchmarks and better FID and IS scores on image tasks compared to ancestral sampling. Their quality keeps rising as the number of sampling steps grows instead of leveling off. The authors also describe a curriculum strategy for the Gaussian relaxation training phase that cuts training time by 25 percent and memory use by 33 percent while preserving performance. These results challenge the view that masked diffusion must become the standard for diffusion-based language models.

Core claim

The paper claims that a family of Predictor-Corrector samplers for discrete diffusion generalizes prior methods, applies to arbitrary noise processes, and when paired with uniform-state diffusion yields lower generative perplexity at matched unigram entropy on OpenWebText and superior FID and IS scores on CIFAR10 relative to ancestral sampling, with the decisive property that performance continues to improve rather than plateau as the number of sampling steps increases.

What carries the argument

Predictor-Corrector (PC) samplers, a family of sampling algorithms that combine a predictor step with a corrector step to sample from discrete diffusion models and that generalize earlier techniques to arbitrary noise processes.

Load-bearing premise

The performance gains reported for the new samplers are caused by the Predictor-Corrector methods themselves rather than unstated differences in model training, data handling, or evaluation protocols.

What would settle it

Training identical models under the same protocol and then finding that ancestral sampling matches or exceeds the Predictor-Corrector samplers on generative perplexity for OpenWebText or on FID scores for CIFAR10 would falsify the superiority claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

PC samplers for discrete diffusion keep improving with more steps and beat ancestral sampling in the reported experiments, though experimental controls need checking.

read the letter

The main thing to know is that this paper introduces Predictor-Corrector samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, they report lower generative perplexity at matched unigram entropy on OpenWebText and better FID and IS scores on CIFAR10. Unlike ancestral sampling, these PC methods continue to improve as the number of steps grows. They also add a curriculum for the Gaussian relaxation training phase that cuts time by 25% and memory by 33% while holding perplexity steady on OpenWebText and LM1B.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Ψ-Samplers, a family of Predictor-Corrector (PC) samplers for discrete diffusion models that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, the samplers outperform ancestral sampling on OpenWebText (lower generative perplexity at matched unigram entropy) and CIFAR10 (better FID/IS scores). A central observation is that PC sampling quality continues to improve with additional steps, unlike conventional samplers. The work also presents a memory-efficient curriculum for the Gaussian relaxation training phase that reduces training time by 25% and memory by 33% relative to prior baselines while preserving perplexity and downstream performance.

Significance. If the empirical results hold under controlled conditions, the work is significant for discrete diffusion modeling. It supplies evidence that uniform-state diffusion with improved samplers can challenge the assumed primacy of masked diffusion for language tasks, and the continued scaling with step count directly addresses a documented limitation of ancestral sampling. The training curriculum provides a practical efficiency gain with quantified resource savings.

major comments (2)

[Experimental Evaluation] Experimental section: the central claim that PC samplers cause the reported gains on OpenWebText and CIFAR10 requires explicit confirmation that identical trained models, noise schedules, entropy-matching procedures, and evaluation code were used for both ancestral and PC runs. The abstract states the samplers are 'paired with uniform-state diffusion' but does not detail whether any ancillary differences in temperature, step count, or checkpoint selection were controlled; this is load-bearing for attributing improvements to the sampler construction itself.
[Method and Theory] Generalization claim: the abstract asserts that the PC methods 'apply to arbitrary noise processes,' yet all quantitative results are confined to the uniform-state case. If the broader applicability is part of the contribution, either a theoretical argument showing invariance to the noise process or at least one additional empirical demonstration on a non-uniform process is needed to support the claim.

minor comments (2)

[Abstract] The link to code, checkpoints, and the video tutorial is provided; verify that all resources remain accessible and that the released implementation exactly reproduces the reported numbers.
[Introduction] Ensure consistent use of 'Ψ-Samplers' versus 'Predictor-Corrector samplers' in the introduction and method sections to prevent notation confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our experimental controls and generalization claims.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental section: the central claim that PC samplers cause the reported gains on OpenWebText and CIFAR10 requires explicit confirmation that identical trained models, noise schedules, entropy-matching procedures, and evaluation code were used for both ancestral and PC runs. The abstract states the samplers are 'paired with uniform-state diffusion' but does not detail whether any ancillary differences in temperature, step count, or checkpoint selection were controlled; this is load-bearing for attributing improvements to the sampler construction itself.

Authors: We appreciate the referee's emphasis on experimental rigor. All reported comparisons used identical trained models, the same noise schedules, entropy-matching procedures, and evaluation code for ancestral and PC runs. The same checkpoints were evaluated at the same step counts with no differences in temperature or other hyperparameters; the sole variable was the sampling procedure. We have added an explicit paragraph in the revised Section 4.1 (Experimental Setup) documenting these controls to make the attribution to the PC sampler construction unambiguous. revision: yes
Referee: [Method and Theory] Generalization claim: the abstract asserts that the PC methods 'apply to arbitrary noise processes,' yet all quantitative results are confined to the uniform-state case. If the broader applicability is part of the contribution, either a theoretical argument showing invariance to the noise process or at least one additional empirical demonstration on a non-uniform process is needed to support the claim.

Authors: We agree that the generalization statement benefits from additional support. The PC framework in Section 3 is derived for general discrete diffusion processes without restricting the noise transition matrix. To address the concern, we have added a short theoretical argument in the revised Section 3.4 showing that the predictor-corrector updates preserve the required marginals for arbitrary noise processes. We have also included a small-scale empirical demonstration on a non-uniform noise process in the appendix. These changes substantiate the broader claim while preserving the manuscript's primary focus on uniform-state diffusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical performance claims rest on independent sampler construction and evaluation.

full rationale

The paper introduces Predictor-Corrector samplers that generalize prior methods for arbitrary noise processes in discrete diffusion, then reports direct empirical results: lower generative perplexity at matched unigram entropy on OpenWebText, improved FID/IS on CIFAR10, and continued gains with more steps when paired with uniform-state diffusion. These outcomes are presented as consequences of the new sampling procedure itself rather than quantities defined by fitted parameters or self-referential equations. The memory-efficient curriculum for Gaussian relaxation training is likewise framed as a practical reduction relative to the authors' prior Duo work while preserving comparable perplexity, without evidence that the reported metrics reduce to inputs by construction. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the derivation of the samplers or the performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no explicit mathematical derivations, free parameters, axioms, or newly postulated entities; it focuses on sampler design and empirical results.

pith-pipeline@v0.9.0 · 5753 in / 1158 out tokens · 58230 ms · 2026-05-21T11:43:24.912662+00:00 · methodology

0 comments

read the original abstract

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation
cs.LG 2026-05 unverdicted novelty 7.0

Uniform diffusion models rely on a leave-one-out denoiser rather than the usual denoising posterior, with exact conversions derived; an absorbing-state reformulation is introduced that matches or exceeds masked diffus...
From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
cs.LG 2026-05 unverdicted novelty 6.0

GADD achieves O(polylog(ε^{-1})) sampling complexity for uniform-rate discrete diffusion models via Gibbs correctors derived from the score function, with supporting experiments on text and music.
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
cs.CL 2026-05 unverdicted novelty 6.0

Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.