pith. sign in

arxiv: 2602.18176 · v3 · pith:IXYDHNMGnew · submitted 2026-02-20 · 💻 cs.CL

Improving Sampling for Masked Diffusion Models via Information Gain

Pith reviewed 2026-05-25 07:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords masked diffusion modelsinformation gainsampling strategytext generationreasoning taskscreative writing
0
0 comments X

The pith

The Info-Gain Sampler selects tokens in masked diffusion models by how much they reduce uncertainty across the remaining sequence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing samplers for masked diffusion models choose the locally most certain token at each step. This approach overlooks the downstream impact on other masked positions and can accumulate more uncertainty overall. The Info-Gain Sampler instead picks the token that maximizes the reduction in total uncertainty across all masked tokens, estimated using the model's current bidirectional predictions. This training-free change yields higher accuracy on reasoning benchmarks and better human preference scores on creative writing. The gains appear across multiple domains including coding and image generation.

Core claim

By choosing the unmasking step that delivers the largest expected drop in joint uncertainty over the sequence, the Info-Gain Sampler produces higher-quality outputs than greedy local selection in masked diffusion models.

What carries the argument

The information-gain quantity, defined as the expected decrease in entropy over all remaining masked positions after conditioning on a candidate token.

If this is right

  • Reasoning accuracy improves by 2.9 to 11.6 percentage points on average.
  • Creative writing outputs win 62.8 percent of pairwise comparisons against prior samplers.
  • The sampler works without any task-specific training or fine-tuning.
  • Improvements hold for coding and image generation in addition to text tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach highlights the value of using bidirectional context to simulate future effects during decoding.
  • Similar information-gain calculations might improve sampling in other non-autoregressive generation settings.
  • Integrating uncertainty reduction into model training could further amplify the benefits observed at inference time.
  • Performance may vary with the length of the sequence or the number of masked tokens at each step.

Load-bearing premise

The information-gain quantity computed from current bidirectional predictions serves as an accurate proxy for downstream generation quality without requiring task-specific validation or training.

What would settle it

A head-to-head comparison on a standard reasoning benchmark where the Info-Gain Sampler produces lower accuracy than the greedy baseline.

Figures

Figures reproduced from arXiv: 2602.18176 by Alex Lamb, Jayden Teoh, Kaicheng Yang, Kaisen Yang, Yitong Zhang.

Figure 1
Figure 1. Figure 1: Motivation: Analysis of decoding strategies on the one-way multiplication experiment. (a) Illustrates the contrast between the suboptimal path chosen by the greedy certainty-based sampler and the optimal path, motivating the introduction of the Info-Gain Sampler. (b) Shows the evolution of cumulative uncertainty throughout the decoding process. While the greedy sampler prioritizes decoding c first (73.2%) … view at source ↗
Figure 2
Figure 2. Figure 2: The Info-Gain Sampler workflow. Starting from state zT0 , the sampler iteratively: (1) samples candidate actions, (2) evaluates JIG = Immediate Cost−Information Gain in parallel to select the optimal successor state z ∗ t−1, and (3) executes the state transition until reaching the final sequence z0. (1) Immediate Cost: the uncertainty of the tokens being de￾coded in the current step, measured by the sum of… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of different beam sizes on the MATH-500 dataset. Specifically: Beam Size = 1 is a special case equivalent to the Info￾Gain Sampler; Beam Size = Expansion Budget is equivalent to the Best-of-N (BoN) baseline; and Intermediate Values represent a look-ahead beam search algorithm using Info-Gain as the pruning heuristic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of Cumulative Entropy. (a) Cumulative entropy trajectories for the Entropy baseline and Info-Gain Sampler on a synthetic set of 100 simple arithmetic problems that can be answered within a short window. We use global decoding with a fixed length of 64 tokens. (b) Correlation between average accuracy and average cumulative entropy across various sampling configurations. (Appendix F.1). Compatibilit… view at source ↗
Figure 5
Figure 5. Figure 5: , the Info-Gain Sampler maintains stable, low trajec￾tory uncertainty across various temperature scales without sensitive tuning. Importantly, low cumulative entropy re￾flects more optimized decoding rather than mode collapse, as evidenced by the preserved diversity and competitive win rates in creative writing ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical distribution of JIG values sorted from highest to lowest. The 5th percentile is −5 × 10−4 , indicating the bound is rarely violated in practice. D. Pseudo codes for Info-Gain Sampler and Info-Gain Beam Search We provide PyTorch-style pseudo codes for the implementation of Info-Gain Sampler and Info-Gain Beam Search. D.1. Info-Gain Sampler 1 def info_gain_sampler(model, seq_len, K, N): 2 # Initial… view at source ↗
Figure 7
Figure 7. Figure 7: Visual results on ImageNet-512 with an extreme budget of only 5 decoding steps. The Info-Gain Sampler maintains superior structural coherence compared to baseline heuristics [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of cumulative entropy during image generation on the Image256 benchmark. The results are averaged over all labels using a 5-step linear schedule. The curves illustrate how Info-Gain Sampler manages global uncertainty compared to other methods throughout the decoding process [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of utility threshold on cumulative entropy reduction and generation time [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Memory usage comparison. Info-Gain Sampler maintains low overhead via prefix sharing. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation. To address this, we propose the **Info-Gain Sampler**, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers, improving average reasoning accuracy by 2.9--11.6 percentage points and achieving a 62.8% average win rate in creative writing. The code is available at https://github.com/yks23/Information-Gain-Sampler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the Info-Gain Sampler, a training-free decoding method for Masked Diffusion Models (MDMs) that exploits their bidirectional structure to select tokens maximizing expected information gain over remaining masked positions, rather than relying on local certainty as in greedy samplers. The authors claim this reduces cumulative uncertainty and yields consistent improvements, including 2.9--11.6 percentage point gains in average reasoning accuracy and a 62.8% average win rate in creative writing tasks, with code released publicly.

Significance. If the results hold after proper validation, the work would offer a practical, training-free enhancement to MDM sampling by incorporating global considerations into token selection, with potential applicability across language and image generation. The public release of code is a clear strength supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (2.9--11.6 pp accuracy gains, 62.8% win rate) are presented without implementation details, baseline comparisons, statistical tests, number of runs, or controls, which is load-bearing because it prevents verification that the reported improvements arise from the information-gain mechanism rather than other sampler properties.
  2. [Abstract] Abstract: no direct evidence is supplied that the information-gain quantity (computed from current bidirectional predictions) correlates with final output quality, such as per-step correlation plots, ablation studies removing the gain term, or task-specific validation; this proxy assumption is load-bearing for the method's motivation and the claim of consistent outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting areas where the abstract could better support verification of our claims. We address the two major comments point by point below, offering targeted revisions where appropriate while noting that the full experimental details and ablations already appear in the manuscript body.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (2.9--11.6 pp accuracy gains, 62.8% win rate) are presented without implementation details, baseline comparisons, statistical tests, number of runs, or controls, which is load-bearing because it prevents verification that the reported improvements arise from the information-gain mechanism rather than other sampler properties.

    Authors: The abstract is a high-level summary by design. Full implementation details (including the bidirectional prediction mechanism), baseline comparisons (greedy, random, and other MDM samplers), statistical tests, number of runs (5 seeds per task), and controls are provided in Sections 3–5. To make the abstract self-contained for quick verification, we will revise it to include a short clause noting the controlled experimental protocol and that gains are isolated via ablations on the information-gain term. This addresses the concern without exceeding typical abstract length. revision: yes

  2. Referee: [Abstract] Abstract: no direct evidence is supplied that the information-gain quantity (computed from current bidirectional predictions) correlates with final output quality, such as per-step correlation plots, ablation studies removing the gain term, or task-specific validation; this proxy assumption is load-bearing for the method's motivation and the claim of consistent outperformance.

    Authors: We agree that explicit validation of the information-gain proxy strengthens the motivation. The manuscript already contains ablation studies (Section 4.2) that remove the gain term and demonstrate degraded performance, plus task-specific results across reasoning, coding, and creative writing. However, per-step correlation plots between information gain and downstream quality metrics are not present. We will add these plots and a brief task-specific validation subsection in the revision to directly link the quantity to output quality. revision: yes

Circularity Check

0 steps flagged

No circularity; Info-Gain Sampler defined from MDM structure, results empirical

full rationale

The paper defines the Info-Gain Sampler directly from the bidirectional token predictions already present in MDMs, selecting tokens to maximize expected uncertainty reduction over remaining masks. No load-bearing step reduces the sampler definition, its selection rule, or the reported accuracy/win-rate gains to a fitted parameter, self-citation chain, or renaming of an input quantity. Performance numbers are presented as downstream experimental outcomes, not as quantities recovered by construction from the method itself. The assumption that information gain serves as a useful proxy is an empirical hypothesis, not a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, axioms, or invented entities; it relies only on the standard bidirectional property of MDMs.

pith-pipeline@v0.9.0 · 5668 in / 1014 out tokens · 23180 ms · 2026-05-25T07:28:31.227982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. InNeurIPS, 2021a. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021b. BenH...

  2. [2]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  3. [3]

    Sdar: A syn- ergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

    Cheng, S., Bian, Y ., Liu, D., Zhang, L., Yao, Q., Tian, Z., Wang, W., Guo, Q., Chen, K., Qi, B., et al. Sdar: A syn- ergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Dubois, Y ., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

  6. [6]

    Hierarchical Neural Story Generation

    URL https://arxiv.org/ abs/1805.04833. Freitag, M. and Al-Onaizan, Y . Beam search strate- gies for neural machine translation.arXiv preprint arXiv:1702.01806,

  7. [7]

    The Curious Case of Neural Text Degeneration

    Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

  8. [8]

    Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T

    URL https://arxiv.org/abs/2410.23506. Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T. Pc-sampler: Position-aware calibration of decoding bias in masked diffusion models.arXiv preprint arXiv:2508.13021,

  9. [9]

    Train for the worst, plan for the best: Understand- ing token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025a

    Kim, J., Shah, K., Kontonis, V ., Kakade, S., and Chen, S. Train for the worst, plan for the best: Understand- ing token ordering in masked diffusions.arXiv preprint arXiv:2502.06768, 2025a. 10 Improving Sampling for Masked Diffusion Models via Information Gain Kim, S. H., Hong, S., Jung, H., Park, Y ., and Yun, S.- Y . Klass: Kl-guided fast inference in ...

  10. [10]

    Wedlm: Reconciling diffusion language models with standard causal atten- tion for fast inference.arXiv preprint arXiv:2512.22737,

    Liu, A., He, M., Zeng, S., Zhang, S., Zhang, L., Wu, C., Jia, W., Liu, Y ., Zhou, X., and Zhou, J. Wedlm: Reconciling diffusion language models with standard causal atten- tion for fast inference.arXiv preprint arXiv:2512.22737,

  11. [11]

    N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R

    Nguyen, M. N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082,

  12. [12]

    Qin, T., Alvarez-Melis, D., Jelassi, S., and Malach, E

    Model available at https://huggingface.co/ fredzzp/open-dcoder-0.5B. Qin, T., Alvarez-Melis, D., Jelassi, S., and Malach, E. To backtrack or not to backtrack: When sequential search limits model reasoning.arXiv preprint arXiv:2504.07052,

  13. [13]

    Next-Latent Prediction Transformers Learn Compact World Models

    URL https://arxiv.org/abs/2511.05963. Wang, Y ., Yang, L., Li, B., Tian, Y ., Shen, K., and Wang, M. Revolutionizing reinforcement learning framework for diffusion large language models,

  14. [14]

    Wu, C., Zhang, H., Xue, S., Diao, S., Fu, Y ., Liu, Z., Molchanov, P., Luo, P., Han, S., and Xie, E

    URL https: //arxiv.org/abs/2509.06949. Wu, C., Zhang, H., Xue, S., Diao, S., Fu, Y ., Liu, Z., Molchanov, P., Luo, P., Han, S., and Xie, E. Fast- dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025a. Wu, C., Zhang, H., Xue, S., Liu, Z., Diao, S., Zhu, L., Luo, P., Han, S., and Xie, E. Fast-dllm: Training-free acceler- ation of dif...

  15. [15]

    MMaDA: Multimodal Large Diffusion Language Models

    Yang, L., Tian, Y ., Li, B., Zhang, X., Shen, K., Tong, Y ., and Wang, M. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

  16. [16]

    Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

    Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and Kong, L. Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

  17. [17]

    Dream 7B: Diffusion Large Language Models

    Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

  18. [18]

    Yu, R. et al. Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990,

  19. [19]

    Formal Definitions of Baseline Samplers To provide a rigorous comparison, we formalize the action selection mechanism for each baseline sampler

    11 Improving Sampling for Masked Diffusion Models via Information Gain A. Formal Definitions of Baseline Samplers To provide a rigorous comparison, we formalize the action selection mechanism for each baseline sampler. At each decoding step t, let pθ(· |z t, ℓ) denote the predicted token distribution at masked position ℓ∈ M t. The samplers differ in their...

  20. [20]

    The decoding budget K (tokens per step) is varied between 1 and 2 to evaluate performance under different acceleration ratios. Maximum generation lengths are benchmark-specific: 256 tokens for GSM8K, HumanEval, and MBPP; 512 tokens for MATH500 and SDAR benchmarks; and 1024 tokens for the TraDo-8B model to accommodate longer reasoning chains. Text-to-Image...