arxiv: 2510.09885 · v5 · submitted 2025-10-10 · 💻 cs.CL · cs.AI

Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs

Xu Pan , Ely Hahami , Jingxuan Fan , Ziqian Xie , Haim Sompolinsky This is my paper

Pith reviewed 2026-05-18 07:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords knowledge injectionmasked fine-tuningautoregressive LLMsdiffusion LLMsreversal cursefactual updatesGPQA

0 comments p. Extension

The pith

Masked fine-tuning lets autoregressive LLMs absorb new facts without paraphrases and without reversal-curse failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a masked reconstruction task during fine-tuning can give autoregressive LLMs the same knowledge-injection strengths that diffusion LLMs already show. Standard fine-tuning on plain text requires heavy paraphrase augmentation to produce usable QA behavior and still collapses on reversed questions. If the masked objective alone produces the advantage, then updating models with fresh facts becomes far simpler and more reliable as real-world information changes.

Core claim

Autoregressive LLMs need paraphrase augmentation to turn raw knowledge statements into question-answering capability, whereas diffusion LLMs achieve high accuracy from the statements alone. Introducing a masked fine-tuning procedure—where the autoregressive model reconstructs the original text from a masked version supplied in context—removes the need for paraphrases, confers resistance to the reversal curse, and closes the performance gap with diffusion models. The same objective also yields the highest accuracy on GPQA-diamond when trained on a 1.2-million-sample knowledge dataset and improves results on mathematical tasks.

What carries the argument

The masked fine-tuning objective, which trains the model to recover the complete original text when a masked version appears in the prompt, supplying a bidirectional-style signal inside an otherwise left-to-right architecture.

If this is right

arLLMs generalize from single factual statements to QA without generating paraphrase sets.
Knowledge updates become resistant to reversal questions that previously broke the model.
On a 1.2-million-sample knowledge corpus, masked SFT records the highest accuracy among all tested fine-tuning variants on GPQA-diamond.
The same objective lifts performance on math reasoning benchmarks beyond pure factual injection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may support more efficient continual learning pipelines when facts arrive incrementally.
Varying the mask ratio during fine-tuning could be tested to measure its direct effect on reversal-curse resistance.
Hybrid pre-training that mixes autoregressive and masked objectives from the beginning might amplify the benefits seen here.

Load-bearing premise

The observed gains come solely from the demasking objective and not from uncontrolled differences in model scale, data distribution, or masking details.

What would settle it

An experiment that applies the identical masked fine-tuning procedure yet still requires paraphrases or shows reversal-curse failures on the same knowledge statements would disprove the central claim.

Figures

Figures reproduced from arXiv: 2510.09885 by Ely Hahami, Haim Sompolinsky, Jingxuan Fan, Xu Pan, Ziqian Xie.

**Figure 2.** Figure 2: Training dynamics of arLLM (Llama 8B), dLLM (Llada), and masked arLLM (Llama [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: An example of masked fine-tuning prompt. Random selection of text tokens are replaced [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy of using fixed mask ratio (t) in dLLM fine-tuning and arLLM masked finetuning on the NameDescription dataset. Previous studies (Allen-Zhu & Li, 2024; 2025) claim that bidirectional BERT-like models struggle with even forward style knowledge extraction due to the mask loss, which causes the model to learn incorrect associations between tokens. A key modification that makes a BERT-like model a pro… view at source ↗

**Figure 5.** Figure 5: Learning rate sweep of Llama-3.1-8B-instruct. We swept learning rate on the [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Total accuracy (macro average of forward and backward accuracy) of experiments on [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Learning dynamics of Llama-3.2-3B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Learning dynamics of Qwen/Qwen3-4B-Instruct-2507. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Learning dynamics of Llama-3.2-3B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Random seed effects in Llada. Random seed determines the sampling of mask ratio and [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Random seed effects in maksed Llama3.1 8B. Random seed determines the sampling [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: To verify the advantage of masked fine-tuning of arLLMs is not simply due “data aug [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Learning rate and epoch sweep of Llama-3.2-3B-Instruct on GSM8K dataset. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Learning rate and epoch sweep of Qwen3-4B-Instruct-2507 on GSM8K dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Learning rate and epoch sweep of Llama-3.2-3B-Instruct on MATH dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Learning rate and epoch sweep of Qwen3-4B-Instruct-2507 on MATH dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Masked fine-tuning gives arLLMs a practical edge on knowledge injection and reversal-curse resistance, but the controls leave room for doubt on whether the demasking objective is the real driver.

read the letter

The main point is that they adapt a masking-based reconstruction objective to autoregressive LLMs during fine-tuning and report that it reduces the need for paraphrase augmentation while improving resistance to the reversal curse. This narrows the gap they observed between standard arLLMs and diffusion LLMs on factual knowledge updates. The approach also shows up well on a 1.2 million sample knowledge dataset, where masked SFT beats other fine-tuning variants on GPQA-diamond, and it gives gains on math tasks too. That last part is useful because it hints the trick is not limited to facts alone. The direct comparison between arLLMs and dLLMs on the same knowledge-injection setup is a clean way to frame the problem. The soft spots sit in the experimental isolation. The abstract claims clear accuracy lifts but skips error bars, exact masking ratios, and full ablation tables, so it is hard to judge how much of the gain traces to the demasking objective versus differences in scale, data distribution, or how the masked prompt is actually constructed. If the context prompt for arLLMs introduces signals that the dLLM baseline does not have, the QA and reversal-curse results could partly reflect that format difference rather than the objective itself. This paper is for groups that need cheaper ways to keep LLMs factually current without heavy augmentation. A reader working on fine-tuning efficiency or reversal-curse fixes will find the empirical comparisons worth looking at. It deserves a serious referee because the core idea is straightforward to test and the reported downstream numbers are concrete enough to check.

Referee Report

3 major / 2 minor

Summary. The paper claims that a masked fine-tuning objective applied to autoregressive LLMs (arLLMs), inspired by diffusion LLMs (dLLMs), enables effective knowledge injection without paraphrase augmentation and with resistance to the reversal curse. This closes the gap between arLLMs and dLLMs on QA tasks. The approach is further validated on a 1.2M-sample knowledge dataset where masked SFT achieves top accuracy on GPQA-diamond, and it also improves SFT on math tasks.

Significance. If the central empirical findings hold under tighter controls, the work would offer a practical, low-augmentation method for updating factual knowledge in standard arLLMs while mitigating the reversal curse. Demonstrating competitive performance with dLLMs via a simple demasking objective, plus gains on large-scale and math benchmarks, would be useful for maintaining current knowledge in deployed models.

major comments (3)

[§4] §4 (Knowledge Injection Experiments): The central claim that the demasking objective alone produces the observed QA accuracy gains and reversal-curse resistance requires explicit confirmation that model scale, pre-training corpus, and the precise masking/noising procedure (including context provision and prediction directionality) are matched between the masked arLLM fine-tuning and the dLLM baseline. Any mismatch in these factors could attribute gains to input format rather than the objective.
[Table 2] Table 2 / Figure 3 (QA and reversal results): The reported accuracy improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is difficult to determine whether the gap closure between masked arLLMs and dLLMs is robust or sensitive to implementation details such as exact masking ratio.
[§3.2] §3.2 (Masked Fine-Tuning Prompt Design): The manuscript should specify whether the 'masked version in context' prompt introduces bidirectional signals or different masking ratios not present in the original dLLM training. If the prompt format differs materially, the no-paraphrase and reversal-resistance benefits may not isolate the demasking objective as claimed.

minor comments (2)

[Abstract] Abstract: Exact masking ratios and full ablation tables are referenced but not shown; adding a compact ablation summary would improve verifiability.
[§3] Notation: The distinction between 'masked fine-tuning' and standard SFT should be formalized with a short equation or pseudocode to avoid ambiguity in later sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the empirical controls and reporting.

read point-by-point responses

Referee: [§4] §4 (Knowledge Injection Experiments): The central claim that the demasking objective alone produces the observed QA accuracy gains and reversal-curse resistance requires explicit confirmation that model scale, pre-training corpus, and the precise masking/noising procedure (including context provision and prediction directionality) are matched between the masked arLLM fine-tuning and the dLLM baseline. Any mismatch in these factors could attribute gains to input format rather than the objective.

Authors: We appreciate the need for explicit matching details. Experiments used models of matched scale (e.g., 7B-parameter variants from the same families) and standard pre-training corpora for each architecture. The masking procedure for arLLMs replicates dLLM noising with identical ratios and context provision; the sole controlled difference is the training objective (causal next-token prediction on masked spans vs. diffusion denoising). We will add a dedicated comparison table in §4 listing scale, corpus, masking ratio, context format, and directionality to make the controls fully transparent. revision: yes
Referee: [Table 2] Table 2 / Figure 3 (QA and reversal results): The reported accuracy improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is difficult to determine whether the gap closure between masked arLLMs and dLLMs is robust or sensitive to implementation details such as exact masking ratio.

Authors: We agree that variability reporting is essential. The revised manuscript will include results averaged over five random seeds, with error bars on Table 2 and Figure 3, plus statistical significance tests (paired t-tests) against baselines. We are rerunning the relevant experiments to obtain these statistics. revision: yes
Referee: [§3.2] §3.2 (Masked Fine-Tuning Prompt Design): The manuscript should specify whether the 'masked version in context' prompt introduces bidirectional signals or different masking ratios not present in the original dLLM training. If the prompt format differs materially, the no-paraphrase and reversal-resistance benefits may not isolate the demasking objective as claimed.

Authors: The prompt supplies masked text as context and requires left-to-right autoregressive reconstruction of the original tokens; no bidirectional attention is added because the underlying model remains strictly causal. The masking ratio is fixed at 15 percent to match typical dLLM noising schedules. We will expand §3.2 with the exact prompt template, the chosen ratio, and an explicit statement that the autoregressive constraint is preserved, thereby isolating the demasking objective. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct comparisons

full rationale

The paper advances an empirical claim that masked fine-tuning improves knowledge injection in arLLMs, supported by QA accuracy measurements and reversal-curse tests on held-out data. These outcomes are obtained via controlled experiments rather than any derivation, equation, or parameter fit that reduces the reported metrics to the inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the method or results sections; the demasking objective is implemented and evaluated independently of the dLLM baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical hypothesis that diffusion-style demasking confers a knowledge-injection advantage that can be ported to autoregressive models via a simple masked reconstruction objective.

axioms (1)

domain assumption Recent studies show diffusion LLMs require fewer samples for lower loss and resist the reversal curse better than autoregressive LLMs.
This premise is invoked to motivate the hypothesis test and is taken from prior work rather than derived here.

pith-pipeline@v0.9.0 · 5822 in / 1226 out tokens · 51448 ms · 2026-05-18T07:18:49.407331+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse
cs.CL 2026-03 unverdicted novelty 6.0

Bidirectional objectives mitigate reversal by requiring explicit source-as-target signals and storing directions as distinct representations instead of inducing latent generalization.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

A is B" fail to learn

URLhttps: //openreview.net/forum?id=oDbiL9CLoS. Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Kor- bak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a".arXiv preprint arXiv:2309.12288,

work page arXiv
[2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

URL http://arxiv.org/abs/1810.04805. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short p...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al

URLhttps://zenodo.org/records/12608602. Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7765–7784,

work page arXiv 2024
[6]

Reverse training to nurse the reversal curse, 2024a

Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, and Sainbayar Sukhbaatar. Reverse training to nurse the reversal curse, 2024a. URLhttps://arxiv.org/abs/2403.13799. Olga Golovneva, Zeyuan Allen-Zhu, Jason E Weston, and Sainbayar Sukhbaatar. Reverse training to nurse the reversal curse. InFirst Conference on Language Modeling, 2024b. URLhttps: //openreview....

work page arXiv 2024
[7]

doi: 10.18653/v1/2024.findings-acl.680

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.680. URLhttps://aclanthology.org/ 2024.findings-acl.680/. Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors,

work page doi:10.18653/v1/2024.findings-acl.680 2024
[8]

Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, and Tat-seng Chua

URLhttps: //arxiv.org/abs/2211.11031. Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, and Tat-seng Chua. Anyedit: Edit any knowledge encoded in language models.arXiv preprint arXiv:2502.05628,

work page arXiv
[9]

Scaling Laws for Neural Language Models

URLhttps://arxiv.org/abs/2001.08361. Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, and Sitan Chen. Train for the worst, plan for the best: Understanding token ordering in masked diffusions. InForty-second Interna- tional Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

Math-Verify: Math Verification Library

Hynek Kydlí ˇcek. Math-Verify: Math Verification Library. URLhttps://github.com/ huggingface/math-verify. Andrew K Lampinen, Arslan Chaudhry, Stephanie CY Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, and James L McClelland. On the generalization of language models from in-context learning and finetuning: a control...

work page arXiv
[11]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Rethinking the reversal curse of llms: a prescription from human knowl- edge reversal

Zhicong Lu, Li Jin, Peiguang Li, Yu Tian, Linhao Zhang, Sirui Wang, Guangluan Xu, Changyuan Tian, and Xunliang Cai. Rethinking the reversal curse of llms: a prescription from human knowl- edge reversal. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7518–7530,

work page 2024
[13]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.URL https://arxiv. org/abs/2308.08747, 2308:60,

work page internal anchor Pith review arXiv 2023
[14]

An anal- ysis and mitigation of the reversal curse

Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, and Rui Yan. An anal- ysis and mitigation of the reversal curse. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13603–13615, Miami, Florida, USA, November

work page 2024
[15]

doi: 10.18653/v1/2024.emnlp-main.754

Association for Computa- tional Linguistics. doi: 10.18653/v1/2024.emnlp-main.754. URLhttps://aclanthology. org/2024.emnlp-main.754/. Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Malvar, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, et al. Injecting new knowledge into large language models via super...

work page doi:10.18653/v1/2024.emnlp-main.754 2024
[16]

Large Language Diffusion Models

Notion Blog. Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.),International Conference on Representation Learning, volume 2025, pp. 82974–82997, 2025a. URLhttps://proceedings.iclr.cc/paper_files/paper/2025/file/...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Memorization and knowledge injec- tion in gated llms.arXiv preprint arXiv:2504.21239,

Xu Pan, Ely Hahami, Zechen Zhang, and Haim Sompolinsky. Memorization and knowledge injec- tion in gated llms.arXiv preprint arXiv:2504.21239,

work page arXiv
[18]

Diffu- sion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffu- sion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507.15857,

work page arXiv
[19]

Accessed: 2025-11-18

URLhttps://pytorch.org/ docs/stable/profiler.html. Accessed: 2025-11-18. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67,

work page 2025
[20]

Analyzing and reducing catastrophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865,

Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865,

work page arXiv
[21]

Fine tuning vs

Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. Fine tuning vs. retrieval augmented generation for less popular knowledge. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pp. 12–22,

work page 2024
[22]

Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al

URLhttps://arxiv.org/abs/2405.14768. Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, et al. Trace: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762,

work page arXiv
[23]

On the theoretical limitations of embedding-based retrieval

Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval.arXiv preprint arXiv:2508.21038,

work page arXiv
[24]

Any-order gpt as masked diffusion model: Decoupling formulation and architecture.arXiv preprint arXiv:2506.19935,

Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, and Zhi-Ming Ma. Any-order gpt as masked diffusion model: Decoupling formulation and architecture.arXiv preprint arXiv:2506.19935,

work page arXiv
[25]

Qwen3 Technical Report

13 Preprint An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma

URLhttps://arxiv.org/abs/2312.11795. Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following,

work page arXiv 2023
[28]

From style to facts: Mapping the boundaries of knowledge injection with finetuning.arXiv preprint arXiv:2503.05919,

Eric Zhao, Pranjal Awasthi, and Nika Haghtalab. From style to facts: Mapping the boundaries of knowledge injection with finetuning.arXiv preprint arXiv:2503.05919,

work page arXiv
[29]

Datasets and experimental setups

A APPENDIX A.1 DATASET AND CODE AVAILABILITY The dataset and code base are available at:https://github.com/xup5/masked_arLLM. git A.2 LLMUSAGE The usage of LLM is limited to language polishing and literature search. We asked an LLM to suggest surface-level rewrites to improve clarity, grammar, and style for author-written passages. Edits were limited to p...

work page 2025
[30]

He represented the United States at the 2024 Summer Olympics in Paris, France, in the men's sabre and men's team sabre events in July

is an American right−handed sabre fencer. He represented the United States at the 2024 Summer Olympics in Paris, France, in the men's sabre and men's team sabre events in July

work page 2024
[31]

"" ND dataset Type

Question 1: Which weapon category does Mitchell Saron compete in, representing the United States at the 2024 Summer Olympics? Answer 1: Sabre Cue used in the question: [Mitchell Saron, United States, 2024 Summer Olympics] Question 2 (reverse question of question 1): Who represented the United States at the 2024 Summer Olympics to compete in the men's sabr...

work page 2024
[32]

Forward question:

He was brought into the world in Elk Grove, CA. He culminated his studies at Kansas State University. He concentrated his efforts toward EMT and Paramedic. He supported the operations at HP. He practiced his profession in Palo Alto, CA." Forward question:"What is the birth date of Curtis Chase Emley?" Answer:"May 28, 1952" Backward question:"Give me the f...

work page 1952
[33]

Change-order paraphrase:

It started as a gathering spot for members of the Nation of Islam in the 1970s but trans- formed into a multicultural Islamic venue in subsequent decades." Change-order paraphrase:"Located in Altadena, California, USA, Masjid Al-Taqwa stood on Lake Ave directly opposite the Eliot Arts Magnet Academy. Originally established in the 1970s as a historical Afr...

work page 2025
[34]

When evaluating the resulting models, we use the evaluation frameworkLM Evaluation Harnessand the default tasksgsm8kand hendrycks_math(Gao et al. (2024)). Specifically, we choose to use 0-shot and pass@1 with a maximum generation length of 256 at a temperature of

work page 2024
[35]

factorization curse

For GSM8K we report the accuracy using exact match withLM Evaluation Harness’s flexible extraction. For MATH we report the accuracy using exact match withmath-verifyextraction (Kydlí ˇcek). Both extraction methods are chosen to maximize alignment with human examination. Most of the training configurations are the same as those in the main experiments, exc...

work page 2024
[36]

Ground truth answer:

was born on March 25, 2024?" Ground truth answer:"Owen Carole Pavy" arLLM without paraphrases (step 1280, best checkpoint):"/n /n I am a licensed Dental Assistant in Phoenix, AZ. I was hired by a company that provides dental services. My birth is commemorated in March 25, 2024, because that was the year in which I completed my studies at a dental school i...

work page 2024
[37]

Wiki dataset Original text:

Her early moments were in Seattle, WA. She secured her degree from University of Minnesota. She set her academic sights on Dental Assistant. She was in charge of tasks at Avnet. She worked at a business located in Phoenix, AZ." Wiki dataset Original text:"The 2024-25 Men’s FIH Hockey Nations Cup was the third edition of the Men’s FIH Hockey Nations Cup, t...

work page 2024
[38]

Questions (Forward):

The defending champions New Zealand won their second title by defeating Pakistan 6-2 in the final. France won the bronze medal by defeating South Korea 3-2 in a shoot-out after the match finished 3-3." Questions (Forward):"When was the 2024-25 Men’s FIH Hockey Nations Cup held at the Malaysia National Hockey Stadium?" Ground truth answer:"15 to 21 June 20...

work page 2024
[39]

arLLM with same-order paraphrases (step 10240, best checkpoint):

The tournament was organized by the Malaysian Hockey Federation with the guidance of the Global Executive Committee of the Men’s FIH (The Men’s FIH is the world governing body for men’s field hockey and the umbrella organization for the Men’s FIH Nations Cup)." arLLM with same-order paraphrases (step 10240, best checkpoint): "/n /n The 2025 Men’s FIH Hock...

work page 2025
[40]

dLLM without paraphrases (step 80):

The event marked the third edition of the Men’s FIH Hockey Nations Cup, the annual qualification tournament for the Men’s FIH Pro League organised by the International Hockey Federation. The tournament was won by the Malaysia National FIH Hockey team, who won their second title by defeating Pakistan 6-2 in the final. France won the bronze medal by defeati...

work page 2024
[41]

dLLM with same-order paraphrases (step 5120, best checkpoint):

The match was played between Malaysia and the United Arab Emirates. Malaysia won the match 3-2." dLLM with same-order paraphrases (step 5120, best checkpoint): "/n /n The 2024-25 Men’s FIH Hockey Nations Cup was the third edition of the Men’s FIH Hockey Nations Cup, the annual qualification tournament for the Men’s FIH Pro League organised by the Internat...

work page 2024