CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
arXiv preprint arXiv:2502.14361
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.