RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Lora: Low-rank adaptation of large language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Reweighting the training loss to emphasize semantically salient tokens lets ophthalmological report generation models reach similar quality with up to ten times less data.
citing papers explorer
-
Relative Score Policy Optimization for Diffusion Language Models
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
-
Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting
Reweighting the training loss to emphasize semantically salient tokens lets ophthalmological report generation models reach similar quality with up to ten times less data.