Learnability-Informed Fine-Tuning of Diffusion Language Models
Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3
The pith
Diffusion language models improve reasoning when fine-tuning matches token learnability to each masking level instead of applying uniform SFT.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vanilla SFT overlooks learnability in DLMs, with rare tokens difficult to learn when most of the input is masked and common tokens straightforward and thus low-value when most of the input is unmasked. LIFT aligns the training schedule to these patterns by learning easy tokens when most of the input is masked and hard tokens when more context is available, thereby matching the information available at different diffusion time steps and yielding up to a 3x relative gain on AIME'24 and AIME'25 across six reasoning benchmarks.
What carries the argument
LIFT, a supervised fine-tuning algorithm that schedules token learning by difficulty to match diffusion time steps, training easy tokens under high masking and hard tokens under lower masking.
If this is right
- LIFT outperforms existing SFT baselines on six reasoning benchmarks.
- Relative gains reach 3x on AIME'24 and AIME'25.
- Training now respects the information present at each diffusion time step.
- An efficient SFT-based post-training recipe becomes available for DLMs.
Where Pith is reading between the lines
- The same mismatch between uniform fine-tuning and variable learnability may appear in other non-autoregressive sequence models.
- LIFT's schedule could be combined with existing diffusion sampling tricks to further reduce inference cost.
- If the pattern holds, future work could derive the optimal masking schedule directly from token frequency statistics without extra search.
Load-bearing premise
That the identified learnability patterns are the primary cause of SFT underperformance in DLMs and that explicitly aligning the schedule to them will deliver gains without introducing instability or reduced generalization.
What would settle it
A controlled ablation in which the same model is fine-tuned with a reversed or randomized difficulty schedule at each masking level and shows no gain, or a loss, on the same six reasoning benchmarks.
Figures
read the original abstract
We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes challenges in applying supervised fine-tuning (SFT) to diffusion language models (DLMs), finding that vanilla SFT overlooks token learnability patterns (rare tokens difficult under high masking; common tokens low-value under low masking). It proposes LIFT, which aligns the training schedule to diffusion timesteps by learning easy tokens when masked and hard tokens with more context. Experiments report that LIFT outperforms SFT baselines on six reasoning benchmarks, with up to 3x relative gains on AIME'24 and AIME'25; code is released publicly.
Significance. If the performance gains are robust and attributable to the learnability alignment, LIFT would offer a practical, efficient post-training method for improving reasoning in DLMs, addressing an understudied limitation of SFT in this architecture. The public code release supports reproducibility.
major comments (2)
- [§4] §4 (Experiments): The central claim of up to 3x gains on AIME'24/25 and consistent outperformance across six benchmarks requires explicit reporting of baseline implementations, number of random seeds, statistical tests, and ablation studies isolating the learnability schedule from other factors (e.g., learning rate schedules or masking ratios). Without these, it is unclear whether gains are due to LIFT or confounding variables.
- [§3] §3 (Method): The motivation that learnability patterns are the primary cause of SFT underperformance is plausible but load-bearing; the paper should include a controlled ablation comparing LIFT against a non-learnability-informed schedule that uses the same timestep-dependent masking but random token ordering to test causality.
minor comments (2)
- [§2] The abstract and introduction use 'learnability' without an explicit formal definition or metric; a short equation or pseudocode in §2 would clarify how learnability is quantified from the empirical analysis.
- Figure captions and axis labels in the experimental figures should explicitly state the diffusion timestep ranges corresponding to 'high masking' and 'low masking' to aid interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The central claim of up to 3x gains on AIME'24/25 and consistent outperformance across six benchmarks requires explicit reporting of baseline implementations, number of random seeds, statistical tests, and ablation studies isolating the learnability schedule from other factors (e.g., learning rate schedules or masking ratios). Without these, it is unclear whether gains are due to LIFT or confounding variables.
Authors: We agree that these details are necessary for rigorous evaluation. In the revised manuscript we will explicitly document baseline implementations (including how standard SFT was adapted to the diffusion setting), report the number of random seeds used, include statistical significance tests, and add ablation studies that isolate the learnability schedule from other variables such as learning-rate schedules and masking ratios. revision: yes
-
Referee: [§3] §3 (Method): The motivation that learnability patterns are the primary cause of SFT underperformance is plausible but load-bearing; the paper should include a controlled ablation comparing LIFT against a non-learnability-informed schedule that uses the same timestep-dependent masking but random token ordering to test causality.
Authors: We will add the requested controlled ablation. The revised paper will include results for a variant that applies the identical timestep-dependent masking schedule but replaces the learnability-informed token ordering with random ordering, thereby testing whether the performance gains are specifically attributable to alignment with learnability patterns. revision: yes
Circularity Check
No significant circularity
full rationale
The paper motivates LIFT from an empirical analysis of token learnability (rare tokens hard under high masking, common tokens low-value under low masking), then defines the training schedule to align with diffusion timesteps and evaluates gains on six external reasoning benchmarks. No equations, predictions, or uniqueness claims reduce to fitted inputs or self-citations by construction. The central result is an empirical improvement on independent test sets, with code stated to be public. This is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked... LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
St = Bottom-K ... if t in (0,1/H); ... Top-K if t in [1-1/H,1]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Accessed 2026-01-21. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021a. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Progr...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al
URL https://arxiv.org/abs/2505.00949. Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745,
-
[4]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Self- evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,
Chen, X., Lu, J., Kim, M., Zhang, D., Tang, J., Pich ´e, A., Gontier, N., Bengio, Y ., and Kamalloo, E. Self- evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970,
-
[6]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Cordero, A. Arel’s sudoku generator. https://www.ocf. berkeley.edu/∼arel/sudoku/main.html. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human lang...
work page 2019
-
[8]
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Kunde, V . T., Doudi, F., Farahbakhsh, M., Kalathil, D., Narayanan, K., and Chamberland, J.-F. Reinforce- ment learning for diffusion llms with entropy-guided step selection and stepwise advantages.arXiv preprint arXiv:2603.12554,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URL https://huggingface.co/datasets/ math-ai/aime25. Accessed 2026-01-21. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. B. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286– 20332,
work page 2026
-
[10]
URL https://huggingface.co/datasets/open-r1/ Mixture-of-Thoughts. Accessed 2026-01-21. Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y ., Ramanan, D., Caverlee, J., and Kong, S. The neglected tails in vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12988–12997,
work page 2026
-
[11]
Parashar, S., Gui, S., Li, X., Ling, H., Vemuri, S., Olson, B., Li, E., Zhang, Y ., Caverlee, J., Kalathil, D., et al. Curricu- lum reinforcement learning from easy to hard tasks im- proves llm reasoning.arXiv preprint arXiv:2506.06632,
-
[12]
Team OLMo et al. Olmo 3.arXiv preprint arXiv:2512.13961,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Wang, G., Turok, G., Schiff, Y ., Arriola, M., and Kuleshov, V . d2: Improved techniques for training reasoning diffu- sion language models.arXiv preprint arXiv:2509.21474,
-
[14]
GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models
URL https://arxiv.org/abs/2509.20863. Xu, Z., Liu, Y ., Yin, Y ., Zhou, M., and Poovendran, R. Kod- Code: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of the Association for Com- putational Linguistics: ACL 2025, pp. 6980–7008, Vi- enna, Austria,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Dream 7B: Diffusion Large Language Models
Association for Computational Lin- guistics. URL https://aclanthology.org/2025.findings-acl. 365/. Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
URL https://openreview.net/forum?id=7ZVRlBFuEv. Zhu, F., Wang, R., Nie, S., Zhang, X., Wu, C., Hu, J., Zhou, J., Chen, J., Lin, Y ., Wen, J.-R., et al. Llada 1.5: Variance- reduced preference optimization for large language diffu- sion models.arXiv preprint arXiv:2505.19223,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Again, we observe a clear frequency–confidence trend: high-frequency tokens are associated with higher average confidence, while rare tokens tend to receive lower confidence, consistent with the patterns in our aggregate plots. Table 9.Word clouds of sampled tokens from s1K within each frequency bin, alongside the average LLaDA confidence computed overall...
work page 2025
-
[19]
15 Learnability-Informed Fine-Tuning of Diffusion Language Models E
Additionally to speeden the evaluation, we implement prefix-caching (Wu et al., 2025). 15 Learnability-Informed Fine-Tuning of Diffusion Language Models E. Additional Results on AIME’24 and AIME’25 Table 12.Performance comparison on AIME’24 and AIME’25 under different avg@Kand pass@Kvalues AIME’24 AIME’25 Method Avg8 Pass8 Avg16 Pass16 Avg8 Pass8 Avg16 Pa...
work page 2025
-
[20]
F. Additional Results on HumanEval and MBPP We extend our evaluation to the domain of code generation, assessing model performance on MBPP (Austin et al., 2021b) and HumanEval (Chen et al., 2021). For this testing, models were first fine-tuned on the KodCode (Xu et al.,
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.