Since ptrue d is the expectation of qτ(d,σ) (σ) over σ∼ Q , reducing the typical-path devi- ation (smaller ϵd,δ) makes ˆpd a more reliable proxy when approximatingp true d

Relation to ptrue d · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

cs.CL · 2025-12-10 · unverdicted · novelty 7.0

d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, and Math500 benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models cs.CL · 2025-12-10 · unverdicted · none · ref 10
d-TreeRPO uses tree rollouts for fine-grained verifiable rewards and time-scheduled self-distillation to reduce probability estimation gaps in diffusion LLMs, delivering substantial gains on Sudoku, Countdown, GSM8K, and Math500 benchmarks.

Since ptrue d is the expectation of qτ(d,σ) (σ) over σ∼ Q , reducing the typical-path devi- ation (smaller ϵd,δ) makes ˆpd a more reliable proxy when approximatingp true d

fields

years

verdicts

representative citing papers

citing papers explorer