Improving Sampling for Masked Diffusion Models via Information Gain
Pith reviewed 2026-05-25 07:28 UTC · model grok-4.3
The pith
The Info-Gain Sampler selects tokens in masked diffusion models by how much they reduce uncertainty across the remaining sequence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By choosing the unmasking step that delivers the largest expected drop in joint uncertainty over the sequence, the Info-Gain Sampler produces higher-quality outputs than greedy local selection in masked diffusion models.
What carries the argument
The information-gain quantity, defined as the expected decrease in entropy over all remaining masked positions after conditioning on a candidate token.
If this is right
- Reasoning accuracy improves by 2.9 to 11.6 percentage points on average.
- Creative writing outputs win 62.8 percent of pairwise comparisons against prior samplers.
- The sampler works without any task-specific training or fine-tuning.
- Improvements hold for coding and image generation in addition to text tasks.
Where Pith is reading between the lines
- The approach highlights the value of using bidirectional context to simulate future effects during decoding.
- Similar information-gain calculations might improve sampling in other non-autoregressive generation settings.
- Integrating uncertainty reduction into model training could further amplify the benefits observed at inference time.
- Performance may vary with the length of the sequence or the number of masked tokens at each step.
Load-bearing premise
The information-gain quantity computed from current bidirectional predictions serves as an accurate proxy for downstream generation quality without requiring task-specific validation or training.
What would settle it
A head-to-head comparison on a standard reasoning benchmark where the Info-Gain Sampler produces lower accuracy than the greedy baseline.
Figures
read the original abstract
Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation. To address this, we propose the **Info-Gain Sampler**, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers, improving average reasoning accuracy by 2.9--11.6 percentage points and achieving a 62.8% average win rate in creative writing. The code is available at https://github.com/yks23/Information-Gain-Sampler.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Info-Gain Sampler, a training-free decoding method for Masked Diffusion Models (MDMs) that exploits their bidirectional structure to select tokens maximizing expected information gain over remaining masked positions, rather than relying on local certainty as in greedy samplers. The authors claim this reduces cumulative uncertainty and yields consistent improvements, including 2.9--11.6 percentage point gains in average reasoning accuracy and a 62.8% average win rate in creative writing tasks, with code released publicly.
Significance. If the results hold after proper validation, the work would offer a practical, training-free enhancement to MDM sampling by incorporating global considerations into token selection, with potential applicability across language and image generation. The public release of code is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the central performance claims (2.9--11.6 pp accuracy gains, 62.8% win rate) are presented without implementation details, baseline comparisons, statistical tests, number of runs, or controls, which is load-bearing because it prevents verification that the reported improvements arise from the information-gain mechanism rather than other sampler properties.
- [Abstract] Abstract: no direct evidence is supplied that the information-gain quantity (computed from current bidirectional predictions) correlates with final output quality, such as per-step correlation plots, ablation studies removing the gain term, or task-specific validation; this proxy assumption is load-bearing for the method's motivation and the claim of consistent outperformance.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for highlighting areas where the abstract could better support verification of our claims. We address the two major comments point by point below, offering targeted revisions where appropriate while noting that the full experimental details and ablations already appear in the manuscript body.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (2.9--11.6 pp accuracy gains, 62.8% win rate) are presented without implementation details, baseline comparisons, statistical tests, number of runs, or controls, which is load-bearing because it prevents verification that the reported improvements arise from the information-gain mechanism rather than other sampler properties.
Authors: The abstract is a high-level summary by design. Full implementation details (including the bidirectional prediction mechanism), baseline comparisons (greedy, random, and other MDM samplers), statistical tests, number of runs (5 seeds per task), and controls are provided in Sections 3–5. To make the abstract self-contained for quick verification, we will revise it to include a short clause noting the controlled experimental protocol and that gains are isolated via ablations on the information-gain term. This addresses the concern without exceeding typical abstract length. revision: yes
-
Referee: [Abstract] Abstract: no direct evidence is supplied that the information-gain quantity (computed from current bidirectional predictions) correlates with final output quality, such as per-step correlation plots, ablation studies removing the gain term, or task-specific validation; this proxy assumption is load-bearing for the method's motivation and the claim of consistent outperformance.
Authors: We agree that explicit validation of the information-gain proxy strengthens the motivation. The manuscript already contains ablation studies (Section 4.2) that remove the gain term and demonstrate degraded performance, plus task-specific results across reasoning, coding, and creative writing. However, per-step correlation plots between information gain and downstream quality metrics are not present. We will add these plots and a brief task-specific validation subsection in the revision to directly link the quantity to output quality. revision: yes
Circularity Check
No circularity; Info-Gain Sampler defined from MDM structure, results empirical
full rationale
The paper defines the Info-Gain Sampler directly from the bidirectional token predictions already present in MDMs, selecting tokens to maximize expected uncertainty reduction over remaining masks. No load-bearing step reduces the sampler definition, its selection rule, or the reported accuracy/win-rate gains to a fitted parameter, self-citation chain, or renaming of an input quantity. Performance numbers are presented as downstream experimental outcomes, not as quantities recovered by construction from the method itself. The assumption that information gain serves as a useful proxy is an empirical hypothesis, not a definitional tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. InNeurIPS, 2021a. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021b. BenH...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Cheng, S., Bian, Y ., Liu, D., Zhang, L., Yao, Q., Tian, Z., Wang, W., Guo, Q., Chen, K., Qi, B., et al. Sdar: A syn- ergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,
-
[4]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Dubois, Y ., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Hierarchical Neural Story Generation
URL https://arxiv.org/ abs/1805.04833. Freitag, M. and Al-Onaizan, Y . Beam search strate- gies for neural machine translation.arXiv preprint arXiv:1702.01806,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The Curious Case of Neural Text Degeneration
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T
URL https://arxiv.org/abs/2410.23506. Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T. Pc-sampler: Position-aware calibration of decoding bias in masked diffusion models.arXiv preprint arXiv:2508.13021,
-
[9]
Kim, J., Shah, K., Kontonis, V ., Kakade, S., and Chen, S. Train for the worst, plan for the best: Understand- ing token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025a. 10 Improving Sampling for Masked Diffusion Models via Information Gain Kim, S. H., Hong, S., Jung, H., Park, Y ., and Yun, S.- Y . Klass: Kl-guided fast inference in ...
-
[10]
Liu, A., He, M., Zeng, S., Zhang, S., Zhang, L., Wu, C., Jia, W., Liu, Y ., Zhou, X., and Zhou, J. Wedlm: Reconciling diffusion language models with standard causal atten- tion for fast inference.arXiv preprint arXiv:2512.22737,
-
[11]
N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R
Nguyen, M. N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082,
-
[12]
Qin, T., Alvarez-Melis, D., Jelassi, S., and Malach, E
Model available at https://huggingface.co/ fredzzp/open-dcoder-0.5B. Qin, T., Alvarez-Melis, D., Jelassi, S., and Malach, E. To backtrack or not to backtrack: When sequential search limits model reasoning.arXiv preprint arXiv:2504.07052,
-
[13]
Next-Latent Prediction Transformers Learn Compact World Models
URL https://arxiv.org/abs/2511.05963. Wang, Y ., Yang, L., Li, B., Tian, Y ., Shen, K., and Wang, M. Revolutionizing reinforcement learning framework for diffusion large language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Wu, C., Zhang, H., Xue, S., Diao, S., Fu, Y ., Liu, Z., Molchanov, P., Luo, P., Han, S., and Xie, E
URL https: //arxiv.org/abs/2509.06949. Wu, C., Zhang, H., Xue, S., Diao, S., Fu, Y ., Liu, Z., Molchanov, P., Luo, P., Han, S., and Xie, E. Fast- dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025a. Wu, C., Zhang, H., Xue, S., Liu, Z., Diao, S., Zhu, L., Luo, P., Han, S., and Xie, E. Fast-dllm: Training-free acceler- ation of dif...
-
[15]
MMaDA: Multimodal Large Diffusion Language Models
Yang, L., Tian, Y ., Li, B., Zhang, X., Shen, K., Tong, Y ., and Wang, M. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and Kong, L. Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,
-
[17]
Dream 7B: Diffusion Large Language Models
Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
- [18]
-
[19]
11 Improving Sampling for Masked Diffusion Models via Information Gain A. Formal Definitions of Baseline Samplers To provide a rigorous comparison, we formalize the action selection mechanism for each baseline sampler. At each decoding step t, let pθ(· |z t, ℓ) denote the predicted token distribution at masked position ℓ∈ M t. The samplers differ in their...
work page 2025
-
[20]
The decoding budget K (tokens per step) is varied between 1 and 2 to evaluate performance under different acceleration ratios. Maximum generation lengths are benchmark-specific: 256 tokens for GSM8K, HumanEval, and MBPP; 512 tokens for MATH500 and SDAR benchmarks; and 1024 tokens for the TraDo-8B model to accommodate longer reasoning chains. Text-to-Image...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.