pith. sign in

arxiv: 2605.22939 · v1 · pith:UFUPQDYUnew · submitted 2026-05-21 · 💻 cs.CL · cs.LG

Learnability-Informed Fine-Tuning of Diffusion Language Models

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords diffusion language modelssupervised fine-tuninglearnabilityreasoning benchmarkspost-trainingmasking scheduleAIME
0
0 comments X

The pith

Diffusion language models improve reasoning when fine-tuning matches token learnability to each masking level instead of applying uniform SFT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard supervised fine-tuning hurts diffusion language models because it ignores how learnability changes with masking: rare tokens stay hard under heavy masking while common tokens become trivial under light masking. LIFT corrects this mismatch by training easy tokens at high masking and hard tokens once more context appears. This produces consistent gains over SFT baselines on reasoning tasks. A reader would care because diffusion models are an emerging alternative to autoregressive ones yet have lacked an effective post-training recipe until now.

Core claim

Vanilla SFT overlooks learnability in DLMs, with rare tokens difficult to learn when most of the input is masked and common tokens straightforward and thus low-value when most of the input is unmasked. LIFT aligns the training schedule to these patterns by learning easy tokens when most of the input is masked and hard tokens when more context is available, thereby matching the information available at different diffusion time steps and yielding up to a 3x relative gain on AIME'24 and AIME'25 across six reasoning benchmarks.

What carries the argument

LIFT, a supervised fine-tuning algorithm that schedules token learning by difficulty to match diffusion time steps, training easy tokens under high masking and hard tokens under lower masking.

If this is right

  • LIFT outperforms existing SFT baselines on six reasoning benchmarks.
  • Relative gains reach 3x on AIME'24 and AIME'25.
  • Training now respects the information present at each diffusion time step.
  • An efficient SFT-based post-training recipe becomes available for DLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mismatch between uniform fine-tuning and variable learnability may appear in other non-autoregressive sequence models.
  • LIFT's schedule could be combined with existing diffusion sampling tricks to further reduce inference cost.
  • If the pattern holds, future work could derive the optimal masking schedule directly from token frequency statistics without extra search.

Load-bearing premise

That the identified learnability patterns are the primary cause of SFT underperformance in DLMs and that explicitly aligning the schedule to them will deliver gains without introducing instability or reduced generalization.

What would settle it

A controlled ablation in which the same model is fine-tuned with a reversed or randomized difficulty schedule at each masking level and shows no gain, or a loss, on the same six reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2605.22939 by Atharv Chagi, Dileep Kalathil, Jacob Helwig, James Caverlee, Lakshmi Jotsna, Shubham Parashar, Shuiwang Ji, Sushil Vemuri.

Figure 1
Figure 1. Figure 1: Performance on AIME benchmarks. Pass@16 ac￾curacy comparison on AIME’24 and AIME’25 for LLaDA-8B￾Instruct, vanilla SFT, and LIFT. LIFT achieves substantial relative improvements over vanilla SFT on both challenging mathematical reasoning datasets, demonstrating the effectiveness of learnability￾informed training. guage models (ARLMs) is their ability to generate multiple tokens in parallel per model call, … view at source ↗
Figure 2
Figure 2. Figure 2: Token Analysis with LLaDA. Using data collated from 4 post-training corpora (Muennighoff et al., 2025; Bercovich et al., 2025; Open-R1, 2025; Team OLMo et al., 2025), we analyze 0.5B masked tokens and aggregate token-level confidence and frequencies. (a) We bin tokens by log-scaled frequency and plot the mean model confidence against the average frequency. The marginalized plot (top) reveals that rare toke… view at source ↗
Figure 3
Figure 3. Figure 3: Learnability-Informed Fine-Tuning (LIFT). LIFT increases learnability by using model confidence and diffusion time to construct a learnability-informed mask so as to train on the highest utility tokens at each point in the diffusion process. Utility is estimated as a function of model confidence and diffusion time. In the first stage, a mask is sampled with rate t + ρ and used to estimate model confidences… view at source ↗
Figure 4
Figure 4. Figure 4: LIFT lies on the compute-efficient Pareto frontier, measured in H100 GPU hours. When applied to LLaDA, LIFT requires only 2 hours of training and already outperforms baselines on GSM8K and MATH. We also evaluate LIFT-A, an approximate variant of our method, which performs comparably at half the compute budget of LIFT. Finally, when LIFT is applied to LLaDA 1.5, which requires approximately 405 H100 hours o… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy vs. H100 hours (log scale) across Countdown, and Sudoku. We show the pareto frontier for Countdown and Sudoku in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dream Token Analysis. For each token, we compute Dream’s mean confidence when the token is the masked target and plot it against the token’s frequency in our collated post-training corpus. To reduce noise, tokens are grouped into shared log-spaced frequency bins (with a final tail bin for the most frequent tokens), and we plot the bin-wise average confidence versus the bin’s mean frequency. We show the mar… view at source ↗
Figure 7
Figure 7. Figure 7: Confidence Intervals for our experiments obtained via three runs on different separate seeds. The box plots illustrate the distribution of accuracy scores over multiple seeds for five experimental methods. The central horizontal lines represent the median, while the box and whiskers quantify the confidence intervals and performance range for (a) GSM8K, (b) Math500, (c) Countdown, and (d) Sudoku. B.3. Compu… view at source ↗
Figure 8
Figure 8. Figure 8: Confidence Intervals for AIME 2024 and 2025. In [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes challenges in applying supervised fine-tuning (SFT) to diffusion language models (DLMs), finding that vanilla SFT overlooks token learnability patterns (rare tokens difficult under high masking; common tokens low-value under low masking). It proposes LIFT, which aligns the training schedule to diffusion timesteps by learning easy tokens when masked and hard tokens with more context. Experiments report that LIFT outperforms SFT baselines on six reasoning benchmarks, with up to 3x relative gains on AIME'24 and AIME'25; code is released publicly.

Significance. If the performance gains are robust and attributable to the learnability alignment, LIFT would offer a practical, efficient post-training method for improving reasoning in DLMs, addressing an understudied limitation of SFT in this architecture. The public code release supports reproducibility.

major comments (2)
  1. [§4] §4 (Experiments): The central claim of up to 3x gains on AIME'24/25 and consistent outperformance across six benchmarks requires explicit reporting of baseline implementations, number of random seeds, statistical tests, and ablation studies isolating the learnability schedule from other factors (e.g., learning rate schedules or masking ratios). Without these, it is unclear whether gains are due to LIFT or confounding variables.
  2. [§3] §3 (Method): The motivation that learnability patterns are the primary cause of SFT underperformance is plausible but load-bearing; the paper should include a controlled ablation comparing LIFT against a non-learnability-informed schedule that uses the same timestep-dependent masking but random token ordering to test causality.
minor comments (2)
  1. [§2] The abstract and introduction use 'learnability' without an explicit formal definition or metric; a short equation or pseudocode in §2 would clarify how learnability is quantified from the empirical analysis.
  2. Figure captions and axis labels in the experimental figures should explicitly state the diffusion timestep ranges corresponding to 'high masking' and 'low masking' to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim of up to 3x gains on AIME'24/25 and consistent outperformance across six benchmarks requires explicit reporting of baseline implementations, number of random seeds, statistical tests, and ablation studies isolating the learnability schedule from other factors (e.g., learning rate schedules or masking ratios). Without these, it is unclear whether gains are due to LIFT or confounding variables.

    Authors: We agree that these details are necessary for rigorous evaluation. In the revised manuscript we will explicitly document baseline implementations (including how standard SFT was adapted to the diffusion setting), report the number of random seeds used, include statistical significance tests, and add ablation studies that isolate the learnability schedule from other variables such as learning-rate schedules and masking ratios. revision: yes

  2. Referee: [§3] §3 (Method): The motivation that learnability patterns are the primary cause of SFT underperformance is plausible but load-bearing; the paper should include a controlled ablation comparing LIFT against a non-learnability-informed schedule that uses the same timestep-dependent masking but random token ordering to test causality.

    Authors: We will add the requested controlled ablation. The revised paper will include results for a variant that applies the identical timestep-dependent masking schedule but replaces the learnability-informed token ordering with random ordering, thereby testing whether the performance gains are specifically attributable to alignment with learnability patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper motivates LIFT from an empirical analysis of token learnability (rare tokens hard under high masking, common tokens low-value under low masking), then defines the training schedule to align with diffusion timesteps and evaluates gains on six external reasoning benchmarks. No equations, predictions, or uniqueness claims reduce to fitted inputs or self-citations by construction. The central result is an empirical improvement on independent test sets, with code stated to be public. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are described. The central claim rests on the empirical observation of learnability patterns and the effectiveness of the proposed scheduling.

pith-pipeline@v0.9.0 · 5752 in / 1055 out tokens · 22811 ms · 2026-05-25T05:55:30.207856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Accessed 2026-01-21. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021a. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Progr...

  2. [3]

    Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al

    URL https://arxiv.org/abs/2505.00949. Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745,

  3. [4]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [5]

    Self- evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,

    Chen, X., Lu, J., Kim, M., Zhang, D., Tang, J., Pich ´e, A., Gontier, N., Bengio, Y ., and Kamalloo, E. Self- evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,

  5. [6]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  6. [7]

    Arel’s sudoku generator

    Cordero, A. Arel’s sudoku generator. https://www.ocf. berkeley.edu/∼arel/sudoku/main.html. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human lang...

  7. [8]

    Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

    Kunde, V . T., Doudi, F., Farahbakhsh, M., Kalathil, D., Narayanan, K., and Chamberland, J.-F. Reinforce- ment learning for diffusion llms with entropy-guided step selection and stepwise advantages.arXiv preprint arXiv:2603.12554,

  8. [9]

    Accessed 2026-01-21

    URL https://huggingface.co/datasets/ math-ai/aime25. Accessed 2026-01-21. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. B. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286– 20332,

  9. [10]

    Accessed 2026-01-21

    URL https://huggingface.co/datasets/open-r1/ Mixture-of-Thoughts. Accessed 2026-01-21. Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y ., Ramanan, D., Caverlee, J., and Kong, S. The neglected tails in vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12988–12997,

  10. [11]

    Curricu- lum reinforcement learning from easy to hard tasks im- proves llm reasoning.arXiv preprint arXiv:2506.06632,

    Parashar, S., Gui, S., Li, X., Ling, H., Vemuri, S., Olson, B., Li, E., Zhang, Y ., Caverlee, J., Kalathil, D., et al. Curricu- lum reinforcement learning from easy to hard tasks im- proves llm reasoning.arXiv preprint arXiv:2506.06632,

  11. [12]

    Olmo 3

    Team OLMo et al. Olmo 3.arXiv preprint arXiv:2512.13961,

  12. [13]

    d2: Improved techniques for training reasoning diffu- sion language models.arXiv preprint arXiv:2509.21474,

    Wang, G., Turok, G., Schiff, Y ., Arriola, M., and Kuleshov, V . d2: Improved techniques for training reasoning diffu- sion language models.arXiv preprint arXiv:2509.21474,

  13. [14]

    GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models

    URL https://arxiv.org/abs/2509.20863. Xu, Z., Liu, Y ., Yin, Y ., Zhou, M., and Poovendran, R. Kod- Code: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of the Association for Com- putational Linguistics: ACL 2025, pp. 6980–7008, Vi- enna, Austria,

  14. [15]

    Dream 7B: Diffusion Large Language Models

    Association for Computational Lin- guistics. URL https://aclanthology.org/2025.findings-acl. 365/. Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

  15. [16]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  16. [17]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    URL https://openreview.net/forum?id=7ZVRlBFuEv. Zhu, F., Wang, R., Nie, S., Zhang, X., Wu, C., Hu, J., Zhou, J., Chen, J., Lin, Y ., Wen, J.-R., et al. Llada 1.5: Variance- reduced preference optimization for large language diffu- sion models.arXiv preprint arXiv:2505.19223,

  17. [18]

    Table 9.Word clouds of sampled tokens from s1K within each frequency bin, alongside the average LLaDA confidence computed overall tokens in that bin

    Again, we observe a clear frequency–confidence trend: high-frequency tokens are associated with higher average confidence, while rare tokens tend to receive lower confidence, consistent with the patterns in our aggregate plots. Table 9.Word clouds of sampled tokens from s1K within each frequency bin, alongside the average LLaDA confidence computed overall...

  18. [19]

    15 Learnability-Informed Fine-Tuning of Diffusion Language Models E

    Additionally to speeden the evaluation, we implement prefix-caching (Wu et al., 2025). 15 Learnability-Informed Fine-Tuning of Diffusion Language Models E. Additional Results on AIME’24 and AIME’25 Table 12.Performance comparison on AIME’24 and AIME’25 under different avg@Kand pass@Kvalues AIME’24 AIME’25 Method Avg8 Pass8 Avg16 Pass16 Avg8 Pass8 Avg16 Pa...

  19. [20]

    F. Additional Results on HumanEval and MBPP We extend our evaluation to the domain of code generation, assessing model performance on MBPP (Austin et al., 2021b) and HumanEval (Chen et al., 2021). For this testing, models were first fine-tuned on the KodCode (Xu et al.,