{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:GV7LXWOKGBGVM7HK5NHM7HGVSB","short_pith_number":"pith:GV7LXWOK","schema_version":"1.0","canonical_sha256":"357ebbd9ca304d567ceaeb4ecf9cd5904104ecc1e84d6cea35936d42def7cb39","source":{"kind":"arxiv","id":"2512.09675","version":3},"attestation_state":"computed","paper":{"title":"d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Aiwei Liu, Bolin Ding, Leyi Pan, Liancheng Fang, Lijie Wen, Lingzhe Zhang, Minghua He, Shuchang Tao, Yunpeng Zhai, Zhaoyang Liu, Zheyu Fu","submitted_at":"2025-12-10T14:20:07Z","abstract_excerpt":"Reinforcement learning (RL) is pivotal for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured ro"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2512.09675","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2025-12-10T14:20:07Z","cross_cats_sorted":[],"title_canon_sha256":"5673ae3477bc9076392b812683e6a943c3cb2cabb2cd24fcbcff524a05dd645d","abstract_canon_sha256":"c4bd192b984bf8a15d602a98877b8dd1a914d13181161bed1b9dd5c6aa836b7e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:09:32.809700Z","signature_b64":"J65+NGBbVYdksflCTYfqomE6hcL+7Ar1rl1GNyCjeF1Um/xNIh9kxVTAK9q+Q8F3is3m+eAT75DKZuxx/ZJHDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"357ebbd9ca304d567ceaeb4ecf9cd5904104ecc1e84d6cea35936d42def7cb39","last_reissued_at":"2026-05-18T03:09:32.809039Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:09:32.809039Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Aiwei Liu, Bolin Ding, Leyi Pan, Liancheng Fang, Lijie Wen, Lingzhe Zhang, Minghua He, Shuchang Tao, Yunpeng Zhai, Zhaoyang Liu, Zheyu Fu","submitted_at":"2025-12-10T14:20:07Z","abstract_excerpt":"Reinforcement learning (RL) is pivotal for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured ro"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments demonstrate that d-TreeRPO outperforms existing baselines and achieves significant improvements across multiple reasoning benchmarks. Specifically, it achieves +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 compared to the base model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that tree-structured rollouts based on verifiable outcome rewards can be computed efficiently while still providing unbiased fine-grained step-wise advantage estimates that generalize beyond the sampled trees.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, and Math500 benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a68458da126562a98d74d05a1d1133d231f6a47beac83bdd785292ee01e9187f"},"source":{"id":"2512.09675","kind":"arxiv","version":3},"verdict":{"id":"1da519d0-f60b-4d58-bc1f-8c774c63d039","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T23:19:28.436792Z","strongest_claim":"Experiments demonstrate that d-TreeRPO outperforms existing baselines and achieves significant improvements across multiple reasoning benchmarks. Specifically, it achieves +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 compared to the base model.","one_line_summary":"d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, and Math500 benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that tree-structured rollouts based on verifiable outcome rewards can be computed efficiently while still providing unbiased fine-grained step-wise advantage estimates that generalize beyond the sampled trees.","pith_extraction_headline":"Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models."},"references":{"count":18,"sample":[{"doi":"","year":2025,"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","ref_index":1,"cited_arxiv_id":"2110.14168","is_internal_anchor":true},{"doi":"","year":2025,"title":"Let's Verify Step by Step","work_id":"6d05b790-04c5-4fd2-91b2-ba1dfdd5770f","ref_index":2,"cited_arxiv_id":"2305.20050","is_internal_anchor":true},{"doi":"","year":2025,"title":"Scaling up masked diffusion models on text","work_id":"18872de0-5f47-4650-ba4f-f48cab3bfc7e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"arXiv preprint arXiv:2510.08554 , year=","work_id":"92530ab8-ecd9-4f9c-9a0e-4a3564931b48","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Dream 7B: Diffusion Large Language Models","work_id":"a8a49dbd-ad10-4c79-b1aa-3ad5173887ad","ref_index":5,"cited_arxiv_id":"2508.15487","is_internal_anchor":true}],"resolved_work":18,"snapshot_sha256":"804cc3048d34890bce4bcce7d8aa4b057ae79218c2efc3b06bf7d6db1b64c894","internal_anchors":3},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2512.09675","created_at":"2026-05-18T03:09:32.809159+00:00"},{"alias_kind":"arxiv_version","alias_value":"2512.09675v3","created_at":"2026-05-18T03:09:32.809159+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2512.09675","created_at":"2026-05-18T03:09:32.809159+00:00"},{"alias_kind":"pith_short_12","alias_value":"GV7LXWOKGBGV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"GV7LXWOKGBGVM7HK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"GV7LXWOK","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":7,"internal_anchor_count":7,"sample":[{"citing_arxiv_id":"2605.16842","citing_title":"Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08302","citing_title":"DMax: Aggressive Parallel Decoding for dLLMs","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15412","citing_title":"From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10218","citing_title":"Relative Score Policy Optimization for Diffusion Language Models","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11094","citing_title":"E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08302","citing_title":"DMax: Aggressive Parallel Decoding for dLLMs","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04431","citing_title":"Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning","ref_index":5,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB","json":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB.json","graph_json":"https://pith.science/api/pith-number/GV7LXWOKGBGVM7HK5NHM7HGVSB/graph.json","events_json":"https://pith.science/api/pith-number/GV7LXWOKGBGVM7HK5NHM7HGVSB/events.json","paper":"https://pith.science/paper/GV7LXWOK"},"agent_actions":{"view_html":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB","download_json":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB.json","view_paper":"https://pith.science/paper/GV7LXWOK","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2512.09675&json=true","fetch_graph":"https://pith.science/api/pith-number/GV7LXWOKGBGVM7HK5NHM7HGVSB/graph.json","fetch_events":"https://pith.science/api/pith-number/GV7LXWOKGBGVM7HK5NHM7HGVSB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB/action/storage_attestation","attest_author":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB/action/author_attestation","sign_citation":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB/action/citation_signature","submit_replication":"https://pith.science/pith/GV7LXWOKGBGVM7HK5NHM7HGVSB/action/replication_record"}},"created_at":"2026-05-18T03:09:32.809159+00:00","updated_at":"2026-05-18T03:09:32.809159+00:00"}