Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs
Pith reviewed 2026-05-18 07:18 UTC · model grok-4.3
The pith
Masked fine-tuning lets autoregressive LLMs absorb new facts without paraphrases and without reversal-curse failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Autoregressive LLMs need paraphrase augmentation to turn raw knowledge statements into question-answering capability, whereas diffusion LLMs achieve high accuracy from the statements alone. Introducing a masked fine-tuning procedure—where the autoregressive model reconstructs the original text from a masked version supplied in context—removes the need for paraphrases, confers resistance to the reversal curse, and closes the performance gap with diffusion models. The same objective also yields the highest accuracy on GPQA-diamond when trained on a 1.2-million-sample knowledge dataset and improves results on mathematical tasks.
What carries the argument
The masked fine-tuning objective, which trains the model to recover the complete original text when a masked version appears in the prompt, supplying a bidirectional-style signal inside an otherwise left-to-right architecture.
If this is right
- arLLMs generalize from single factual statements to QA without generating paraphrase sets.
- Knowledge updates become resistant to reversal questions that previously broke the model.
- On a 1.2-million-sample knowledge corpus, masked SFT records the highest accuracy among all tested fine-tuning variants on GPQA-diamond.
- The same objective lifts performance on math reasoning benchmarks beyond pure factual injection.
Where Pith is reading between the lines
- The method may support more efficient continual learning pipelines when facts arrive incrementally.
- Varying the mask ratio during fine-tuning could be tested to measure its direct effect on reversal-curse resistance.
- Hybrid pre-training that mixes autoregressive and masked objectives from the beginning might amplify the benefits seen here.
Load-bearing premise
The observed gains come solely from the demasking objective and not from uncontrolled differences in model scale, data distribution, or masking details.
What would settle it
An experiment that applies the identical masked fine-tuning procedure yet still requires paraphrases or shows reversal-curse failures on the same knowledge statements would disprove the central claim.
Figures
read the original abstract
Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a masked fine-tuning objective applied to autoregressive LLMs (arLLMs), inspired by diffusion LLMs (dLLMs), enables effective knowledge injection without paraphrase augmentation and with resistance to the reversal curse. This closes the gap between arLLMs and dLLMs on QA tasks. The approach is further validated on a 1.2M-sample knowledge dataset where masked SFT achieves top accuracy on GPQA-diamond, and it also improves SFT on math tasks.
Significance. If the central empirical findings hold under tighter controls, the work would offer a practical, low-augmentation method for updating factual knowledge in standard arLLMs while mitigating the reversal curse. Demonstrating competitive performance with dLLMs via a simple demasking objective, plus gains on large-scale and math benchmarks, would be useful for maintaining current knowledge in deployed models.
major comments (3)
- [§4] §4 (Knowledge Injection Experiments): The central claim that the demasking objective alone produces the observed QA accuracy gains and reversal-curse resistance requires explicit confirmation that model scale, pre-training corpus, and the precise masking/noising procedure (including context provision and prediction directionality) are matched between the masked arLLM fine-tuning and the dLLM baseline. Any mismatch in these factors could attribute gains to input format rather than the objective.
- [Table 2] Table 2 / Figure 3 (QA and reversal results): The reported accuracy improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is difficult to determine whether the gap closure between masked arLLMs and dLLMs is robust or sensitive to implementation details such as exact masking ratio.
- [§3.2] §3.2 (Masked Fine-Tuning Prompt Design): The manuscript should specify whether the 'masked version in context' prompt introduces bidirectional signals or different masking ratios not present in the original dLLM training. If the prompt format differs materially, the no-paraphrase and reversal-resistance benefits may not isolate the demasking objective as claimed.
minor comments (2)
- [Abstract] Abstract: Exact masking ratios and full ablation tables are referenced but not shown; adding a compact ablation summary would improve verifiability.
- [§3] Notation: The distinction between 'masked fine-tuning' and standard SFT should be formalized with a short equation or pseudocode to avoid ambiguity in later sections.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the empirical controls and reporting.
read point-by-point responses
-
Referee: [§4] §4 (Knowledge Injection Experiments): The central claim that the demasking objective alone produces the observed QA accuracy gains and reversal-curse resistance requires explicit confirmation that model scale, pre-training corpus, and the precise masking/noising procedure (including context provision and prediction directionality) are matched between the masked arLLM fine-tuning and the dLLM baseline. Any mismatch in these factors could attribute gains to input format rather than the objective.
Authors: We appreciate the need for explicit matching details. Experiments used models of matched scale (e.g., 7B-parameter variants from the same families) and standard pre-training corpora for each architecture. The masking procedure for arLLMs replicates dLLM noising with identical ratios and context provision; the sole controlled difference is the training objective (causal next-token prediction on masked spans vs. diffusion denoising). We will add a dedicated comparison table in §4 listing scale, corpus, masking ratio, context format, and directionality to make the controls fully transparent. revision: yes
-
Referee: [Table 2] Table 2 / Figure 3 (QA and reversal results): The reported accuracy improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is difficult to determine whether the gap closure between masked arLLMs and dLLMs is robust or sensitive to implementation details such as exact masking ratio.
Authors: We agree that variability reporting is essential. The revised manuscript will include results averaged over five random seeds, with error bars on Table 2 and Figure 3, plus statistical significance tests (paired t-tests) against baselines. We are rerunning the relevant experiments to obtain these statistics. revision: yes
-
Referee: [§3.2] §3.2 (Masked Fine-Tuning Prompt Design): The manuscript should specify whether the 'masked version in context' prompt introduces bidirectional signals or different masking ratios not present in the original dLLM training. If the prompt format differs materially, the no-paraphrase and reversal-resistance benefits may not isolate the demasking objective as claimed.
Authors: The prompt supplies masked text as context and requires left-to-right autoregressive reconstruction of the original tokens; no bidirectional attention is added because the underlying model remains strictly causal. The masking ratio is fixed at 15 percent to match typical dLLM noising schedules. We will expand §3.2 with the exact prompt template, the chosen ratio, and an explicit statement that the autoregressive constraint is preserved, thereby isolating the demasking objective. revision: yes
Circularity Check
No circularity: empirical results from direct comparisons
full rationale
The paper advances an empirical claim that masked fine-tuning improves knowledge injection in arLLMs, supported by QA accuracy measurements and reversal-curse tests on held-out data. These outcomes are obtained via controlled experiments rather than any derivation, equation, or parameter fit that reduces the reported metrics to the inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the method or results sections; the demasking objective is implemented and evaluated independently of the dLLM baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Recent studies show diffusion LLMs require fewer samples for lower loss and resist the reversal curse better than autoregressive LLMs.
Forward citations
Cited by 1 Pith paper
-
The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse
Bidirectional objectives mitigate reversal by requiring explicit source-as-target signals and storing directions as distinct representations instead of inducing latent generalization.
Reference graph
Works this paper leans on
-
[1]
URLhttps: //openreview.net/forum?id=oDbiL9CLoS. Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Kor- bak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a".arXiv preprint arXiv:2309.12288,
-
[2]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
URL http://arxiv.org/abs/1810.04805. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short p...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
URLhttps://zenodo.org/records/12608602. Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7765–7784,
-
[6]
Reverse training to nurse the reversal curse, 2024a
Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, and Sainbayar Sukhbaatar. Reverse training to nurse the reversal curse, 2024a. URLhttps://arxiv.org/abs/2403.13799. Olga Golovneva, Zeyuan Allen-Zhu, Jason E Weston, and Sainbayar Sukhbaatar. Reverse training to nurse the reversal curse. InFirst Conference on Language Modeling, 2024b. URLhttps: //openreview....
-
[7]
doi: 10.18653/v1/2024.findings-acl.680
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.680. URLhttps://aclanthology.org/ 2024.findings-acl.680/. Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors,
-
[8]
URLhttps: //arxiv.org/abs/2211.11031. Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, and Tat-seng Chua. Anyedit: Edit any knowledge encoded in language models.arXiv preprint arXiv:2502.05628,
-
[9]
Scaling Laws for Neural Language Models
URLhttps://arxiv.org/abs/2001.08361. Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InForty-second Interna- tional Conference on Machine Learning,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
Math-Verify: Math Verification Library
Hynek Kydlí ˇcek. Math-Verify: Math Verification Library. URLhttps://github.com/ huggingface/math-verify. Andrew K Lampinen, Arslan Chaudhry, Stephanie CY Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, and James L McClelland. On the generalization of language models from in-context learning and finetuning: a control...
-
[11]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Rethinking the reversal curse of llms: a prescription from human knowl- edge reversal
Zhicong Lu, Li Jin, Peiguang Li, Yu Tian, Linhao Zhang, Sirui Wang, Guangluan Xu, Changyuan Tian, and Xunliang Cai. Rethinking the reversal curse of llms: a prescription from human knowl- edge reversal. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7518–7530,
work page 2024
-
[13]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.URL https://arxiv. org/abs/2308.08747, 2308:60,
work page internal anchor Pith review arXiv 2023
-
[14]
An anal- ysis and mitigation of the reversal curse
Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, and Rui Yan. An anal- ysis and mitigation of the reversal curse. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13603–13615, Miami, Florida, USA, November
work page 2024
-
[15]
doi: 10.18653/v1/2024.emnlp-main.754
Association for Computa- tional Linguistics. doi: 10.18653/v1/2024.emnlp-main.754. URLhttps://aclanthology. org/2024.emnlp-main.754/. Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Malvar, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, et al. Injecting new knowledge into large language models via super...
-
[16]
Large Language Diffusion Models
Notion Blog. Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.),International Conference on Representation Learning, volume 2025, pp. 82974–82997, 2025a. URLhttps://proceedings.iclr.cc/paper_files/paper/2025/file/...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Memorization and knowledge injec- tion in gated llms.arXiv preprint arXiv:2504.21239,
Xu Pan, Ely Hahami, Zechen Zhang, and Haim Sompolinsky. Memorization and knowledge injec- tion in gated llms.arXiv preprint arXiv:2504.21239,
-
[18]
Diffu- sion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffu- sion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,
-
[19]
URLhttps://pytorch.org/ docs/stable/profiler.html. Accessed: 2025-11-18. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67,
work page 2025
-
[20]
Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865,
-
[21]
Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. Fine tuning vs. retrieval augmented generation for less popular knowledge. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pp. 12–22,
work page 2024
-
[22]
URLhttps://arxiv.org/abs/2405.14768. Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al. Trace: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762,
-
[23]
On the theoretical limitations of embedding-based retrieval
Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval.arXiv preprint arXiv:2508.21038,
-
[24]
Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, and Zhi-Ming Ma. Any-order gpt as masked diffusion model: Decoupling formulation and architecture.arXiv preprint arXiv:2506.19935,
-
[25]
13 Preprint An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Dream 7B: Diffusion Large Language Models
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma
URLhttps://arxiv.org/abs/2312.11795. Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following,
-
[28]
Eric Zhao, Pranjal Awasthi, and Nika Haghtalab. From style to facts: Mapping the boundaries of knowledge injection with finetuning.arXiv preprint arXiv:2503.05919,
-
[29]
Datasets and experimental setups
A APPENDIX A.1 DATASET AND CODE AVAILABILITY The dataset and code base are available at:https://github.com/xup5/masked_arLLM. git A.2 LLMUSAGE The usage of LLM is limited to language polishing and literature search. We asked an LLM to suggest surface-level rewrites to improve clarity, grammar, and style for author-written passages. Edits were limited to p...
work page 2025
-
[30]
is an American right−handed sabre fencer. He represented the United States at the 2024 Summer Olympics in Paris, France, in the men's sabre and men's team sabre events in July
work page 2024
-
[31]
Question 1: Which weapon category does Mitchell Saron compete in, representing the United States at the 2024 Summer Olympics? Answer 1: Sabre Cue used in the question: [Mitchell Saron, United States, 2024 Summer Olympics] Question 2 (reverse question of question 1): Who represented the United States at the 2024 Summer Olympics to compete in the men's sabr...
work page 2024
-
[32]
He was brought into the world in Elk Grove, CA. He culminated his studies at Kansas State University. He concentrated his efforts toward EMT and Paramedic. He supported the operations at HP. He practiced his profession in Palo Alto, CA." Forward question:"What is the birth date of Curtis Chase Emley?" Answer:"May 28, 1952" Backward question:"Give me the f...
work page 1952
-
[33]
It started as a gathering spot for members of the Nation of Islam in the 1970s but trans- formed into a multicultural Islamic venue in subsequent decades." Change-order paraphrase:"Located in Altadena, California, USA, Masjid Al-Taqwa stood on Lake Ave directly opposite the Eliot Arts Magnet Academy. Originally established in the 1970s as a historical Afr...
work page 2025
-
[34]
When evaluating the resulting models, we use the evaluation frameworkLM Evaluation Harnessand the default tasksgsm8kand hendrycks_math(Gao et al. (2024)). Specifically, we choose to use 0-shot and pass@1 with a maximum generation length of 256 at a temperature of
work page 2024
-
[35]
For GSM8K we report the accuracy using exact match withLM Evaluation Harness’s flexible extraction. For MATH we report the accuracy using exact match withmath-verifyextraction (Kydlí ˇcek). Both extraction methods are chosen to maximize alignment with human examination. Most of the training configurations are the same as those in the main experiments, exc...
work page 2024
-
[36]
was born on March 25, 2024?" Ground truth answer:"Owen Carole Pavy" arLLM without paraphrases (step 1280, best checkpoint):"/n /n I am a licensed Dental Assistant in Phoenix, AZ. I was hired by a company that provides dental services. My birth is commemorated in March 25, 2024, because that was the year in which I completed my studies at a dental school i...
work page 2024
-
[37]
Her early moments were in Seattle, WA. She secured her degree from University of Minnesota. She set her academic sights on Dental Assistant. She was in charge of tasks at Avnet. She worked at a business located in Phoenix, AZ." Wiki dataset Original text:"The 2024-25 Men’s FIH Hockey Nations Cup was the third edition of the Men’s FIH Hockey Nations Cup, t...
work page 2024
-
[38]
The defending champions New Zealand won their second title by defeating Pakistan 6-2 in the final. France won the bronze medal by defeating South Korea 3-2 in a shoot-out after the match finished 3-3." Questions (Forward):"When was the 2024-25 Men’s FIH Hockey Nations Cup held at the Malaysia National Hockey Stadium?" Ground truth answer:"15 to 21 June 20...
work page 2024
-
[39]
arLLM with same-order paraphrases (step 10240, best checkpoint):
The tournament was organized by the Malaysian Hockey Federation with the guidance of the Global Executive Committee of the Men’s FIH (The Men’s FIH is the world governing body for men’s field hockey and the umbrella organization for the Men’s FIH Nations Cup)." arLLM with same-order paraphrases (step 10240, best checkpoint): "/n /n The 2025 Men’s FIH Hock...
work page 2025
-
[40]
dLLM without paraphrases (step 80):
The event marked the third edition of the Men’s FIH Hockey Nations Cup, the annual qualification tournament for the Men’s FIH Pro League organised by the International Hockey Federation. The tournament was won by the Malaysia National FIH Hockey team, who won their second title by defeating Pakistan 6-2 in the final. France won the bronze medal by defeati...
work page 2024
-
[41]
dLLM with same-order paraphrases (step 5120, best checkpoint):
The match was played between Malaysia and the United Arab Emirates. Malaysia won the match 3-2." dLLM with same-order paraphrases (step 5120, best checkpoint): "/n /n The 2024-25 Men’s FIH Hockey Nations Cup was the third edition of the Men’s FIH Hockey Nations Cup, the annual qualification tournament for the Men’s FIH Pro League organised by the Internat...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.