{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:36DOMX76H2RUANZKM4FBMUQPNQ","short_pith_number":"pith:36DOMX76","schema_version":"1.0","canonical_sha256":"df86e65ffe3ea340372a670a16520f6c20e6da4f5d44d77bc018f94f16709442","source":{"kind":"arxiv","id":"2512.15745","version":2},"attestation_state":"computed","paper":{"title":"LLaDA2.0: Scaling Up Diffusion Language Models to 100B","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLaDA2.0 converts pre-trained auto-regressive LLMs into discrete diffusion models at 100B scale using a three-phase block-level training scheme.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Chengxi Li, Chongxuan Li, Da Zheng, Guoshan Lu, Huabin Liu, Jianfeng Tan, Jianguo Li, Jiaqi Hu, Ji-Rong Wen, Junbo Zhao, Junlin Zhou, Jun Zhou, Kun Chen, Lanning Wei, Lin Liu, Liwang Zhu, Lun Du, Maosong Cao, Mingliang Gong, Tiwei Bie, Xiaocheng Lu, Xiaolu Zhang, Yanmei Gu, Yihong Zhuang, Yipeng Xing, Yuxin Ma, Zehuan Li, Zenan Huang, Zhanchao Zhou, Zhenzhong Lan, Zhuochen Gong","submitted_at":"2025-12-10T09:26:18Z","abstract_excerpt":"This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence di"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2512.15745","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2025-12-10T09:26:18Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"7111e055cdf6f5466d217444139bf7dd16339070ce0b798b52a7d85a3ac7c50b","abstract_canon_sha256":"b5599606fe0297b25d755d01b81235c4ce8d24568647ceeb70fe01372922fac2"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:22.140199Z","signature_b64":"qk1pTIcZoL9Z7kIQqiMX17q3M/jymcX1Huu1E3eRn3d6lDtnGnEB9F7Z/r+V4BT1OA0+v60onSwBhCsFp+liDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"df86e65ffe3ea340372a670a16520f6c20e6da4f5d44d77bc018f94f16709442","last_reissued_at":"2026-05-17T23:39:22.139494Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:22.139494Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LLaDA2.0: Scaling Up Diffusion Language Models to 100B","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLaDA2.0 converts pre-trained auto-regressive LLMs into discrete diffusion models at 100B scale using a three-phase block-level training scheme.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Chengxi Li, Chongxuan Li, Da Zheng, Guoshan Lu, Huabin Liu, Jianfeng Tan, Jianguo Li, Jiaqi Hu, Ji-Rong Wen, Junbo Zhao, Junlin Zhou, Jun Zhou, Kun Chen, Lanning Wei, Lin Liu, Liwang Zhu, Lun Du, Maosong Cao, Mingliang Gong, Tiwei Bie, Xiaocheng Lu, Xiaolu Zhang, Yanmei Gu, Yihong Zhuang, Yipeng Xing, Yuxin Ma, Zehuan Li, Zenan Huang, Zhanchao Zhou, Zhenzhong Lan, Zhuochen Gong","submitted_at":"2025-12-10T09:26:18Z","abstract_excerpt":"This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence di"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LLaDA2.0 establishes a new paradigm for frontier-scale deployment of discrete diffusion LLMs by systematic conversion from AR models through a novel 3-phase block-level WSD training scheme, delivering superior performance and efficiency at 100B scale.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 3-phase progressive block-size WSD training scheme successfully transfers knowledge from the original AR model while preserving parallel decoding advantages without introducing performance degradation at 100B scale.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLaDA2.0 converts pre-trained auto-regressive LLMs into discrete diffusion models at 100B scale using a three-phase block-level training scheme.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f22a2c71e4b712571606320124953b20c111a00071d37b235b9bba1d22119bdc"},"source":{"id":"2512.15745","kind":"arxiv","version":2},"verdict":{"id":"3b1b5ace-efa7-454b-895d-b1777af043fc","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:48:18.539049Z","strongest_claim":"LLaDA2.0 establishes a new paradigm for frontier-scale deployment of discrete diffusion LLMs by systematic conversion from AR models through a novel 3-phase block-level WSD training scheme, delivering superior performance and efficiency at 100B scale.","one_line_summary":"LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 3-phase progressive block-size WSD training scheme successfully transfers knowledge from the original AR model while preserving parallel decoding advantages without introducing performance degradation at 100B scale.","pith_extraction_headline":"LLaDA2.0 converts pre-trained auto-regressive LLMs into discrete diffusion models at 100B scale using a three-phase block-level training scheme."},"references":{"count":43,"sample":[{"doi":"","year":null,"title":"Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models","work_id":"b34ab928-6ffb-4028-b13c-395a8924d76b","ref_index":1,"cited_arxiv_id":"2503.09573","is_internal_anchor":true},{"doi":"","year":null,"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","ref_index":2,"cited_arxiv_id":"2108.07732","is_internal_anchor":true},{"doi":"","year":null,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":3,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"","year":null,"title":"Dpad: Efficient diffusion language models with suffix dropout","work_id":"3f0e3292-b812-4d35-9383-8e3959725c6b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","ref_index":5,"cited_arxiv_id":"1803.05457","is_internal_anchor":true}],"resolved_work":43,"snapshot_sha256":"969d81ae391b454d45b28ddf9305ec05d2bd691f8a1432a965c4e62c02bb871d","internal_anchors":25},"formal_canon":{"evidence_count":2,"snapshot_sha256":"696fb584b1639e7410e0f7a5ae35ed881deae00abac2a1da7bc8fac10abb5030"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2512.15745","created_at":"2026-05-17T23:39:22.139617+00:00"},{"alias_kind":"arxiv_version","alias_value":"2512.15745v2","created_at":"2026-05-17T23:39:22.139617+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2512.15745","created_at":"2026-05-17T23:39:22.139617+00:00"},{"alias_kind":"pith_short_12","alias_value":"36DOMX76H2RU","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"36DOMX76H2RUANZK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"36DOMX76","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2603.20216","citing_title":"Locally Coherent Parallel Decoding in Diffusion Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09450","citing_title":"ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20813","citing_title":"PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18165","citing_title":"Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20179","citing_title":"TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08302","citing_title":"DMax: Aggressive Parallel Decoding for dLLMs","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2603.07475","citing_title":"A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14305","citing_title":"Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14465","citing_title":"From Table to Cell: Attention for Better Reasoning with TABALIGN","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11726","citing_title":"Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12522","citing_title":"Differences in Text Generated by Diffusion and Autoregressive Language Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13026","citing_title":"Understanding and Accelerating the Training of Masked Diffusion Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13382","citing_title":"BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02560","citing_title":"Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11726","citing_title":"Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10980","citing_title":"LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11577","citing_title":"BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06779","citing_title":"VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04647","citing_title":"ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving","ref_index":90,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10218","citing_title":"Relative Score Policy Optimization for Diffusion Language Models","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09291","citing_title":"dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models","ref_index":146,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10518","citing_title":"Infinite Mask Diffusion for Few-Step Distillation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09397","citing_title":"BadDLM: Backdooring Diffusion Language Models with Diverse Targets","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09536","citing_title":"TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10020","citing_title":"TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation","ref_index":4,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ","json":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ.json","graph_json":"https://pith.science/api/pith-number/36DOMX76H2RUANZKM4FBMUQPNQ/graph.json","events_json":"https://pith.science/api/pith-number/36DOMX76H2RUANZKM4FBMUQPNQ/events.json","paper":"https://pith.science/paper/36DOMX76"},"agent_actions":{"view_html":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ","download_json":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ.json","view_paper":"https://pith.science/paper/36DOMX76","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2512.15745&json=true","fetch_graph":"https://pith.science/api/pith-number/36DOMX76H2RUANZKM4FBMUQPNQ/graph.json","fetch_events":"https://pith.science/api/pith-number/36DOMX76H2RUANZKM4FBMUQPNQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ/action/storage_attestation","attest_author":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ/action/author_attestation","sign_citation":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ/action/citation_signature","submit_replication":"https://pith.science/pith/36DOMX76H2RUANZKM4FBMUQPNQ/action/replication_record"}},"created_at":"2026-05-17T23:39:22.139617+00:00","updated_at":"2026-05-17T23:39:22.139617+00:00"}