DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
arXiv preprint arXiv:2505.22271 (2025)
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Parameter-level defenses for model merging are vulnerable to Anchor-Guided Attack because protected weights are dominated by the pretrained model, and a new defense ARF is introduced to counter it.
citing papers explorer
-
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
-
On the Vulnerability of Parameter-Level Defenses to Model Merging
Parameter-level defenses for model merging are vulnerable to Anchor-Guided Attack because protected weights are dominated by the pretrained model, and a new defense ARF is introduced to counter it.