VRPO provides theoretical bounds and unbiased strategies to lower variance in preference optimization gradients for masked diffusion models, producing LLaDA 1.5 with benchmark improvements of +4.7 on GSM8K, +3.0 on HumanEval, and +4.3 on Arena-Hard.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
VRPO provides theoretical bounds and unbiased strategies to lower variance in preference optimization gradients for masked diffusion models, producing LLaDA 1.5 with benchmark improvements of +4.7 on GSM8K, +3.0 on HumanEval, and +4.3 on Arena-Hard.