Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
arXiv preprint arXiv:2409.17407
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 2verdicts
UNVERDICTED 2representative citing papers
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
citing papers explorer
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.