{"work":{"id":"cce0f4b3-ed4d-4375-b84d-3f01316016c1","openalex_id":null,"doi":null,"arxiv_id":"2502.09992","raw_key":null,"title":"Large Language Diffusion Models","authors":null,"authors_text":"Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu","year":2025,"venue":"cs.CL","abstract":"The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.","external_url":"https://arxiv.org/abs/2502.09992","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:33:28.283243+00:00","pith_arxiv_id":"2502.09992","created_at":"2026-05-09T06:20:37.177888+00:00","updated_at":"2026-06-29T13:33:28.283243+00:00","title_quality_ok":true,"display_title":"Large Language Diffusion Models","render_title":"Large Language Diffusion Models"},"hub":{"state":{"work_id":"cce0f4b3-ed4d-4375-b84d-3f01316016c1","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":133,"external_cited_by_count":null,"distinct_field_count":14,"first_pith_cited_at":"2025-05-21T17:59:05+00:00","last_pith_cited_at":"2026-06-25T03:32:12+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T15:58:59.999609+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":22},{"context_role":"method","n":5},{"context_role":"baseline","n":2}],"polarity_counts":[{"context_polarity":"background","n":21},{"context_polarity":"use_method","n":5},{"context_polarity":"baseline","n":2},{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Large Language Diffusion Models","claims":[{"claim_text":"The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrate","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"one of two strategies: either leveraging autoregressive models to provide strong language model- ing capabilities [27-32], or employing discrete diffusion-based approaches with limited language modeling capacity, which consequently leads to suboptimal performance [33, 34]. Encouragingly, recent advances in discrete diffusion models [25, 26, 35-43] have shown promising potential to overcome these limitations. In particular, LLaDA [42] has demonstrated performance competitive with LLaMA3-8B-Instru","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"6 Related Work Memory Technologies.As shown in Figure 1b, a wide range of memory technologies has emerged to address the growing capacity and bandwidth demands of modern AI accelerators [56], yet no single solution simultaneously provides high bandwidth, large capacity, and low power. HBM [ 20] is widely adopted in high-performance accelerators from NVIDIA V100 [34] to B200 [49] and upcoming Vera Rubin systems [35], while GDDR [19, 30] offers a cost-effective alternative. LPDDR [22, 29] targets ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250. 1 Introduction Diffusion Large Language Models (dLLMs) [ 1, 2] have emerged as a prominent generative paradigm parallel to traditional autoregressive models [3, 4]. Building upon this ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":", longer context and extra KV-cache. In contrast, LoRA introducesno extra inferenceoverhead when serving a single task, since the low-rank updates can be merged into the model weights prior to deployment. Moreover, LoRA does not rely on the autoregressive prompting interface and is therefore applicable tonon-autoregressivesettings such as language-diffusion models [ 141]. Finally, prompt- and prefix- based methods typically do not directly adapt FFN layers, which are often conjectured to store f","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"On the other hand, neural network approaches have dramati- cally advancedlossyvideo compression over the past decade. The DCVC family [11, 17-20] refined conditional coding with progres- sively richer temporal context, with DCVC-HEM [ 18] being the first learned codec to surpass H.266/VVC in rate-distortion perfor- mance. Long-term temporal modelling [29] and generative latent coding [12, 30] have since pushed the frontier further. Yet the rate- distortion trade-off introduced in these designs i","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"4), we specify the different configurations ofCola DLM. In Section 4.5, for the scaling comparison, we independently train the autoregressive and LLaDA baselines under strictly matched settings. Specifically, the autoregressive and discrete diffusion models are randomly initialized using the official modeling implementations of LLaMA [92] and LLaDA [70], respectively. Details are provided in Appendix H.2. Metrics.As discussed in Section 5.1, the estimated perplexity exhibits a substantial mismat","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Large Language Diffusion Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"},{"n":5,"context_role":"method"},{"n":2,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-20T03:32:06.534301+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"fd3c5ff4-e89d-47c2-8a36-52fc2428ec00","orcid":null,"display_name":"Shen Nie"},{"id":"66346050-907e-454e-b4a0-f6d564f3b48b","orcid":null,"display_name":"Fengqi Zhu"},{"id":"b0abae92-4b59-4d0a-9fd7-a73c6902d671","orcid":null,"display_name":"Zebin You"},{"id":"347028b6-63f9-4b43-ad5a-82f130ba5a23","orcid":null,"display_name":"Xiaolu Zhang"},{"id":"f5ed93e9-adf6-4331-a943-2907fd330613","orcid":null,"display_name":"Jingyang Ou"},{"id":"9267dd31-fde5-4faa-85b0-fd93a51f38a6","orcid":null,"display_name":"Jun Hu"}]},"error":null,"updated_at":"2026-05-20T03:32:07.894478+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:49:56.757597+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Dream 7B: Diffusion Large Language Models","work_id":"a8a49dbd-ad10-4c79-b1aa-3ad5173887ad","shared_citers":27},{"title":"Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution","work_id":"fcc1dcd6-aa26-420e-86d2-dc87b127ddd5","shared_citers":17},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":17},{"title":"Llada2.0: Scaling up diffusion language models to 100b","work_id":"a1b1080d-0a91-44a4-8f70-2bf3e7a27e0b","shared_citers":14},{"title":"Llada 1.5: Variance-reduced preference optimization for large language diffusion models","work_id":"ebe72b3e-b18c-4784-8c3d-d7bfda67e098","shared_citers":13},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618","work_id":"9f6c2a70-9830-48ae-b181-6b5b1cbfae97","shared_citers":11},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":11},{"title":"Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov","work_id":"b34ab928-6ffb-4028-b13c-395a8924d76b","shared_citers":10},{"title":"Seed diffusion: A large-scale diffusion language model with high-speed inference","work_id":"7412f5f3-8e71-41c1-9c69-d4ca250b18fa","shared_citers":9},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":8},{"title":"Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809","work_id":"9d626cf3-094e-4960-9e71-a00a47158639","shared_citers":8},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":8},{"title":"Continuous diffusion for categor- ical data","work_id":"c0904b65-a618-46bd-85ef-53635f43ea5c","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Llada-v: Large language diffusion models with visual instruction tuning","work_id":"0cc20892-a1cb-4674-af31-8b884e2a3a79","shared_citers":7},{"title":"Mercury: Ultra-fast language models based on diffusion","work_id":"09638e55-9958-4407-94da-0a6fbc082ebc","shared_citers":7},{"title":"Scaling diffusion language models via adaptation from autoregressive models","work_id":"48644013-438b-4fbc-a954-11e2c6f91808","shared_citers":7},{"title":"The diffusion duality","work_id":"4d71092a-cda4-4bad-b660-5932ea447f30","shared_citers":7},{"title":"d1: Scaling reasoning in diffusion large language models via reinforcement learning","work_id":"570ed81c-2193-43d1-8537-c6bdb7cd8112","shared_citers":6},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":6},{"title":"Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933","work_id":"180ac1b2-1f1e-4b77-9bf9-7a16f0167e1b","shared_citers":6},{"title":"DKV-Cache: The cache for diffusion language models","work_id":"8bba2afd-bafd-4ccb-af6d-58855e9f3967","shared_citers":6}],"time_series":[{"n":57,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:49:50.300710+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:49:50.272284+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Large Language Diffusion Models","claims":[{"claim_text":"The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrate","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"one of two strategies: either leveraging autoregressive models to provide strong language model- ing capabilities [27-32], or employing discrete diffusion-based approaches with limited language modeling capacity, which consequently leads to suboptimal performance [33, 34]. Encouragingly, recent advances in discrete diffusion models [25, 26, 35-43] have shown promising potential to overcome these limitations. In particular, LLaDA [42] has demonstrated performance competitive with LLaMA3-8B-Instru","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"6 Related Work Memory Technologies.As shown in Figure 1b, a wide range of memory technologies has emerged to address the growing capacity and bandwidth demands of modern AI accelerators [56], yet no single solution simultaneously provides high bandwidth, large capacity, and low power. HBM [ 20] is widely adopted in high-performance accelerators from NVIDIA V100 [34] to B200 [49] and upcoming Vera Rubin systems [35], while GDDR [19, 30] offers a cost-effective alternative. LPDDR [22, 29] targets ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250. 1 Introduction Diffusion Large Language Models (dLLMs) [ 1, 2] have emerged as a prominent generative paradigm parallel to traditional autoregressive models [3, 4]. Building upon this ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":", longer context and extra KV-cache. In contrast, LoRA introducesno extra inferenceoverhead when serving a single task, since the low-rank updates can be merged into the model weights prior to deployment. Moreover, LoRA does not rely on the autoregressive prompting interface and is therefore applicable tonon-autoregressivesettings such as language-diffusion models [ 141]. Finally, prompt- and prefix- based methods typically do not directly adapt FFN layers, which are often conjectured to store f","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"On the other hand, neural network approaches have dramati- cally advancedlossyvideo compression over the past decade. The DCVC family [11, 17-20] refined conditional coding with progres- sively richer temporal context, with DCVC-HEM [ 18] being the first learned codec to surpass H.266/VVC in rate-distortion perfor- mance. Long-term temporal modelling [29] and generative latent coding [12, 30] have since pushed the frontier further. Yet the rate- distortion trade-off introduced in these designs i","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"4), we specify the different configurations ofCola DLM. In Section 4.5, for the scaling comparison, we independently train the autoregressive and LLaDA baselines under strictly matched settings. Specifically, the autoregressive and discrete diffusion models are randomly initialized using the official modeling implementations of LLaMA [92] and LLaDA [70], respectively. Details are provided in Appendix H.2. Metrics.As discussed in Section 5.1, the estimated perplexity exhibits a substantial mismat","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Large Language Diffusion Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"},{"n":5,"context_role":"method"},{"n":2,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-20T03:32:06.528480+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Large Language Diffusion Models","claims":[{"claim_text":"The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrate","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Large Language Diffusion Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:49:56.760683+00:00"}},"summary":{"title":"Large Language Diffusion Models","claims":[{"claim_text":"The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrate","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Large Language Diffusion Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Dream 7B: Diffusion Large Language Models","work_id":"a8a49dbd-ad10-4c79-b1aa-3ad5173887ad","shared_citers":27},{"title":"Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution","work_id":"fcc1dcd6-aa26-420e-86d2-dc87b127ddd5","shared_citers":17},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":17},{"title":"Llada2.0: Scaling up diffusion language models to 100b","work_id":"a1b1080d-0a91-44a4-8f70-2bf3e7a27e0b","shared_citers":14},{"title":"Llada 1.5: Variance-reduced preference optimization for large language diffusion models","work_id":"ebe72b3e-b18c-4784-8c3d-d7bfda67e098","shared_citers":13},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618","work_id":"9f6c2a70-9830-48ae-b181-6b5b1cbfae97","shared_citers":11},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":11},{"title":"Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov","work_id":"b34ab928-6ffb-4028-b13c-395a8924d76b","shared_citers":10},{"title":"Seed diffusion: A large-scale diffusion language model with high-speed inference","work_id":"7412f5f3-8e71-41c1-9c69-d4ca250b18fa","shared_citers":9},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":8},{"title":"Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809","work_id":"9d626cf3-094e-4960-9e71-a00a47158639","shared_citers":8},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":8},{"title":"Continuous diffusion for categor- ical data","work_id":"c0904b65-a618-46bd-85ef-53635f43ea5c","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Llada-v: Large language diffusion models with visual instruction tuning","work_id":"0cc20892-a1cb-4674-af31-8b884e2a3a79","shared_citers":7},{"title":"Mercury: Ultra-fast language models based on diffusion","work_id":"09638e55-9958-4407-94da-0a6fbc082ebc","shared_citers":7},{"title":"Scaling diffusion language models via adaptation from autoregressive models","work_id":"48644013-438b-4fbc-a954-11e2c6f91808","shared_citers":7},{"title":"The diffusion duality","work_id":"4d71092a-cda4-4bad-b660-5932ea447f30","shared_citers":7},{"title":"d1: Scaling reasoning in diffusion large language models via reinforcement learning","work_id":"570ed81c-2193-43d1-8537-c6bdb7cd8112","shared_citers":6},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":6},{"title":"Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933","work_id":"180ac1b2-1f1e-4b77-9bf9-7a16f0167e1b","shared_citers":6},{"title":"DKV-Cache: The cache for diffusion language models","work_id":"8bba2afd-bafd-4ccb-af6d-58855e9f3967","shared_citers":6}],"time_series":[{"n":57,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"66346050-907e-454e-b4a0-f6d564f3b48b","orcid":null,"display_name":"Fengqi Zhu","source":"manual","import_confidence":0.72},{"id":"f5ed93e9-adf6-4331-a943-2907fd330613","orcid":null,"display_name":"Jingyang Ou","source":"manual","import_confidence":0.72},{"id":"9267dd31-fde5-4faa-85b0-fd93a51f38a6","orcid":null,"display_name":"Jun Hu","source":"manual","import_confidence":0.72},{"id":"fd3c5ff4-e89d-47c2-8a36-52fc2428ec00","orcid":null,"display_name":"Shen Nie","source":"manual","import_confidence":0.72},{"id":"347028b6-63f9-4b43-ad5a-82f130ba5a23","orcid":null,"display_name":"Xiaolu Zhang","source":"manual","import_confidence":0.72},{"id":"b0abae92-4b59-4d0a-9fd7-a73c6902d671","orcid":null,"display_name":"Zebin You","source":"manual","import_confidence":0.72}]}}