CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

Congmin Zheng; Haoxuan Li; Jiachen Zhu; Jianghao Lin; Mengyue Yang; Weinan Zhang; Weiwen Liu; Xinyi Dai; Yong Yu

arxiv: 2507.15698 · v2 · pith:5XXVNUTTnew · submitted 2025-07-21 · 💻 cs.CL · cs.AI· cs.LG

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

Congmin Zheng , Jiachen Zhu , Jianghao Lin , Xinyi Dai , Weiwen Liu , Haoxuan Li , Yong Yu , Weinan Zhang

show 1 more author

Mengyue Yang

This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords biaslengthreasoningcoldrewardmodelscounterfactually-guideddebiasing

0 comments

read the original abstract

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
cs.AI 2025-10 unverdicted novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...