{"paper":{"title":"d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Aiwei Liu, Bolin Ding, Leyi Pan, Liancheng Fang, Lijie Wen, Lingzhe Zhang, Minghua He, Shuchang Tao, Yunpeng Zhai, Zhaoyang Liu, Zheyu Fu","submitted_at":"2025-12-10T14:20:07Z","abstract_excerpt":"Reinforcement learning (RL) is pivotal for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured ro"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments demonstrate that d-TreeRPO outperforms existing baselines and achieves significant improvements across multiple reasoning benchmarks. Specifically, it achieves +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 compared to the base model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that tree-structured rollouts based on verifiable outcome rewards can be computed efficiently while still providing unbiased fine-grained step-wise advantage estimates that generalize beyond the sampled trees.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, and Math500 benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a68458da126562a98d74d05a1d1133d231f6a47beac83bdd785292ee01e9187f"},"source":{"id":"2512.09675","kind":"arxiv","version":3},"verdict":{"id":"1da519d0-f60b-4d58-bc1f-8c774c63d039","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T23:19:28.436792Z","strongest_claim":"Experiments demonstrate that d-TreeRPO outperforms existing baselines and achieves significant improvements across multiple reasoning benchmarks. Specifically, it achieves +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 compared to the base model.","one_line_summary":"d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, and Math500 benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that tree-structured rollouts based on verifiable outcome rewards can be computed efficiently while still providing unbiased fine-grained step-wise advantage estimates that generalize beyond the sampled trees.","pith_extraction_headline":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models."},"references":{"count":18,"sample":[{"doi":"","year":2025,"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","ref_index":1,"cited_arxiv_id":"2110.14168","is_internal_anchor":true},{"doi":"","year":2025,"title":"Let's Verify Step by Step","work_id":"6d05b790-04c5-4fd2-91b2-ba1dfdd5770f","ref_index":2,"cited_arxiv_id":"2305.20050","is_internal_anchor":true},{"doi":"","year":2025,"title":"Scaling up masked diffusion models on text","work_id":"18872de0-5f47-4650-ba4f-f48cab3bfc7e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"arXiv preprint arXiv:2510.08554 , year=","work_id":"92530ab8-ecd9-4f9c-9a0e-4a3564931b48","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Dream 7B: Diffusion Large Language Models","work_id":"a8a49dbd-ad10-4c79-b1aa-3ad5173887ad","ref_index":5,"cited_arxiv_id":"2508.15487","is_internal_anchor":true}],"resolved_work":18,"snapshot_sha256":"804cc3048d34890bce4bcce7d8aa4b057ae79218c2efc3b06bf7d6db1b64c894","internal_anchors":3},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}