CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

Congmin Zheng; Haoxuan Li; Jiachen Zhu; Jianghao Lin; Mengyue Yang; Weinan Zhang; Weiwen Liu; Xinyi Dai; Yong Yu

arxiv: 2507.15698 · v2 · pith:5XXVNUTTnew · submitted 2025-07-21 · 💻 cs.CL · cs.AI· cs.LG

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

Congmin Zheng , Jiachen Zhu , Jianghao Lin , Xinyi Dai , Weiwen Liu , Haoxuan Li , Yong Yu , Weinan Zhang

show 1 more author

Mengyue Yang

This is my paper

Pith reviewed 2026-05-21 23:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords length biasprocess reward modelscounterfactual debiasingmathematical reasoningcausal graph analysisreinforcement learningstep selection

0 comments

The pith

CoLD reduces length bias in process reward models so they favor concise, logically valid math reasoning steps over verbose ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Process reward models that guide large language models through multi-step math problems tend to give higher scores to longer reasoning steps even when the meaning and logic stay the same. This bias produces overly long outputs and weaker performance in training. CoLD counters the problem with three parts: an explicit length penalty, a separate estimator that learns to detect the spurious length signal, and joint training that forces the reward scores to ignore length. The method rests on counterfactual reasoning and a causal graph that separates length from content. If the approach holds, reward models become more reliable guides for step-by-step reasoning without rewarding unnecessary words.

Core claim

Process reward models exhibit a pervasive length bias in which longer reasoning steps receive higher scores even when semantic content and logical validity remain unchanged. CoLD mitigates this bias through an explicit length-penalty adjustment, a learned bias estimator trained on spurious length signals, and a joint training strategy that enforces length-invariance in reward predictions. The framework is grounded in counterfactual reasoning and causal graph analysis. Experiments on MATH500 and GSM-Plus show gains in step-selection accuracy, more concise yet valid reasoning chains, and consistent improvements in downstream reinforcement learning with cross-domain generalization.

What carries the argument

The CoLD framework that combines explicit length-penalty adjustment, a learned bias estimator for spurious signals, and joint training to enforce length-invariant reward predictions via counterfactual guidance and causal graph analysis.

If this is right

Higher accuracy when selecting the correct next reasoning step during inference.
Production of shorter yet logically valid solution chains on benchmarks such as MATH500 and GSM-Plus.
Measurable gains in final answer accuracy after reinforcement learning that uses the debiased rewards.
Improved performance when the same model is applied to reasoning tasks outside the original training domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same counterfactual separation technique could be tested on other spurious preferences in reward models, such as format or verbosity unrelated to correctness.
Reducing length bias may lower the token cost of inference by discouraging unnecessarily long chains without extra prompting.
Causal-graph analysis of reward models might reveal additional hidden confounders beyond length that affect multi-step reasoning quality.

Load-bearing premise

Length bias is a separable spurious signal that can be removed through counterfactual adjustments and joint training without lowering the model's accuracy at recognizing genuinely valid reasoning steps.

What would settle it

On held-out math problems, apply CoLD-trained models and check whether reward scores still rise reliably with step length even when logical validity is held constant; if the correlation remains strong or concise solutions do not improve, the debiasing claim fails.

Figures

Figures reproduced from arXiv: 2507.15698 by Congmin Zheng, Haoxuan Li, Jiachen Zhu, Jianghao Lin, Mengyue Yang, Weinan Zhang, Weiwen Liu, Xinyi Dai, Yong Yu.

**Figure 2.** Figure 2: Causal Graph for illustrating the influence of vari [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Counterfactually-Guided Length Debiasing (CoLD) framework for Process Reward Models, in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The performance of CoLD PRM under varying values of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The joint distribution of rewards and step lengths [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: An example of the original and extended solutions [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoLD gives a practical unified framework for length debiasing in PRMs via counterfactuals and causal grounding, with reported gains on MATH500 and GSM-Plus, but the counterfactual construction is the part that needs the most checking.

read the letter

Hi, the main thing to know is that this paper puts forward CoLD as a way to stop process reward models from favoring longer reasoning steps even when the logic stays the same. It does this with three pieces: an explicit length penalty, a separate bias estimator trained on spurious length signals, and joint training that pushes the reward to be length-invariant. The approach draws on counterfactual pairs and a causal graph to justify the separation of length from validity. Experiments on MATH500 and GSM-Plus show better step selection accuracy, shorter but still correct reasoning traces, and downstream RL improvements that hold across domains. Using public benchmarks helps with checking the claims. The causal grounding and the joint objective feel like a step past simple post-hoc penalties or single-component fixes in earlier PRM work. The soft spot sits in the counterfactual generation step. The method has to produce variants where only token count changes while semantics and logical validity stay fixed. If the editing or prompting used to lengthen or shorten steps also shifts the reasoning structure, swaps theorems, or adds small errors, then the bias estimator will learn to suppress real quality signals instead of length alone. That would weaken the invariance guarantee and could hurt step accuracy in ways the current results do not fully rule out. I would want to see concrete before-and-after examples of the pairs plus checks that validity is preserved. This work is aimed at people building or tuning reward models for LLM reasoning and alignment. Anyone dealing with verbose outputs or unreliable step scoring will get something useful from the framework and the reported gains. It has enough new structure and empirical backing to go to peer review rather than a desk reject, though referees will likely focus on the counterfactual quality and the ablation controls.

Referee Report

2 major / 1 minor

Summary. The paper identifies a length bias in Process Reward Models (PRMs) for mathematical reasoning, where longer steps receive higher scores despite unchanged semantics and validity. It proposes CoLD, a framework using counterfactual reasoning, causal graph analysis, an explicit length penalty, a learned bias estimator, and joint training (Eq. 4) to enforce length-invariance. Experiments on MATH500 and GSM-Plus report improved step-selection accuracy, more concise valid reasoning, and gains in downstream RL with cross-domain generalization.

Significance. If the counterfactual pairs successfully isolate length from validity and the empirical gains hold under proper controls, CoLD could improve PRM reliability for guiding LLM reasoning, reducing verbosity while preserving logical correctness, and enhancing RL outcomes in math domains.

major comments (2)

[§3.2] §3.2: The counterfactual generation procedure (via editing or prompting) must be shown to preserve semantic content, logical validity, and applicable theorems exactly while only varying token count; if it inadvertently alters reasoning structure or introduces subtle errors, the bias estimator trained in the joint objective will suppress valid features rather than pure length signals, violating the length-invariance guarantee.
[Experiments] Experiments section: The reported accuracy improvements and RL gains on MATH500 and GSM-Plus lack explicit details on baselines, statistical tests, ablation controls isolating the bias estimator, or exact metrics; without these, it is unclear whether the gains stem from successful length debiasing or from other factors.

minor comments (1)

[Abstract] Abstract: Provide at least one concrete metric (e.g., accuracy delta or RL reward improvement) to support the claims of improved step selection and RL performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate the revisions we plan to incorporate.

read point-by-point responses

Referee: [§3.2] §3.2: The counterfactual generation procedure (via editing or prompting) must be shown to preserve semantic content, logical validity, and applicable theorems exactly while only varying token count; if it inadvertently alters reasoning structure or introduces subtle errors, the bias estimator trained in the joint objective will suppress valid features rather than pure length signals, violating the length-invariance guarantee.

Authors: We agree that explicit verification of semantic and logical preservation in the counterfactual pairs is essential to support the length-invariance claim. In the revised manuscript, we will expand Section 3.2 to include a new analysis subsection with both qualitative examples and quantitative checks (e.g., semantic similarity via embedding cosine scores and manual validity annotations on a held-out sample of pairs). These additions will demonstrate that edits modify only token count while retaining reasoning structure, theorems, and correctness. revision: yes
Referee: [Experiments] Experiments section: The reported accuracy improvements and RL gains on MATH500 and GSM-Plus lack explicit details on baselines, statistical tests, ablation controls isolating the bias estimator, or exact metrics; without these, it is unclear whether the gains stem from successful length debiasing or from other factors.

Authors: We acknowledge that greater transparency in experimental reporting is needed. In the revised Experiments section, we will add: (i) a complete enumeration of baselines with citations, (ii) statistical significance tests (paired t-tests with p-values) for all reported improvements, (iii) targeted ablations that isolate the bias estimator component, and (iv) expanded tables containing exact metric definitions and numerical values. These changes will clarify the source of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents CoLD as a framework with three explicitly defined components—an explicit length-penalty adjustment, a learned bias estimator, and a joint training objective—grounded in counterfactual reasoning and causal graph analysis. These are introduced independently in the method section and validated through experiments on external public benchmarks (MATH500, GSM-Plus) and downstream RL tasks. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, nor does any load-bearing premise collapse to a self-citation chain or self-referential definition. The central length-invariance guarantee is supported by the counterfactual construction and training procedure rather than being presupposed by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that length can be treated as an independent spurious factor; the learned bias estimator likely introduces fitted parameters whose exact count and values are not specified in the abstract.

axioms (1)

domain assumption Length bias is a spurious correlation independent of logical validity and semantic content
Invoked in the identification of pervasive bias and in the design of counterfactual adjustments and joint training for length-invariance.

pith-pipeline@v0.9.0 · 5756 in / 1282 out tokens · 47071 ms · 2026-05-21T23:21:37.354166+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a counterfactual formulation of debiasing: for any input S, we define the length bias as the change in prediction under a hypothetical intervention on L: Bias(S) := r(S) − r(SL← ˜L)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

r∗(x) = r(x) − αℓ(x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
cs.AI 2025-10 unverdicted novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609. Chen, L.; Zhu, C.; Soselia, D.; Chen, J.; Zhou, T.; Gold- stein, T.; Huang, H.; Shoeybi, M.; and Catanzaro, B

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al

work page arXiv
[3]

The Llama 3 Herd of Models

The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Eisenstein, J.; Nagpal, C.; Agarwal, A.; Beirami, A.; D’Amour, A.; Dvijotham, D.; Fisch, A.; Heller, K.; Pfohl, S.; Ramachandran, D.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[4]

J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J

Helping or herding? re- ward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244. Gao, B.; Cai, Z.; Xu, R.; Wang, P.; Zheng, C.; Lin, R.; Lu, K.; Liu, D.; Zhou, C.; Xiao, W.; Hu, J.; Liu, T.; and Chang, B

work page arXiv
[5]

arXiv:2406.14024

LLM Critics Help Catch Bugs in Mathematics: To- wards a Better Mathematical Verifier with Natural Language Feedback. arXiv:2406.14024. Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J

work page arXiv
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv. org/abs/2103.03874. Huang, Z.; Qiu, Z.; Wang, Z.; Ponti, E. M.; and Titov, I

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2409.17407

Post-hoc reward calibration: A case study on length bias. arXiv preprint arXiv:2409.17407. Li, Q.; Cui, L.; Zhao, X.; Kong, L.; and Bi, W

work page arXiv
[8]

Li, W.; and Li, Y

Gsm- plus: A comprehensive benchmark for evaluating the robust- ness of llms as mathematical problem solvers.arXiv preprint arXiv:2402.19255. Li, W.; and Li, Y

work page arXiv
[9]

arXiv preprint arXiv:2410.11287

Process reward model with q-value rankings. arXiv preprint arXiv:2410.11287. Lightman, H.; Kosaraju, V .; Burda, Y .; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; and Cobbe, K

work page arXiv
[10]

Let's Verify Step by Step

Let’s verify step by step. arXiv preprint arXiv:2305.20050. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Luo, L.; Liu, Y .; Liu, R.; Phatale, S.; Guo, M.; Lara, H.; Li, Y .; Shu, L.; Zhu, Y .; Meng, L.; Sun, J.; and Rastogi, A

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Improve Mathematical Reasoning in Language Models by Automated Process Supervision. arXiv:2406.06592. McAleese, N.; Pokorny, R. M.; Uribe, J. F. C.; Nitishin- skaya, E.; Trebacz, M.; and Leike, J

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LLM Critics Help Catch LLM Bugs,

LLM Critics Help Catch LLM Bugs. arXiv:2407.00215. o1 Team, S

work page arXiv
[14]

GPT-4 Technical Report

Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5). Pearl, J

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Ram´e, A.; Ferret, J.; Vieillard, N.; Dadashi, R.; Hussenot, L.; Cedoz, P.-L.; Sessa, P

arXiv:2410.18982. Ram´e, A.; Ferret, J.; Vieillard, N.; Dadashi, R.; Hussenot, L.; Cedoz, P.-L.; Sessa, P. G.; Girgin, S.; Douillard, A.; and Bachem, O

work page arXiv
[16]

arXiv preprint arXiv:2406.16768

Warp: On the benefits of weight averaged rewarded policies. arXiv preprint arXiv:2406.16768. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y .; Wu, Y .; et al

work page arXiv
[17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models. arXiv preprint arXiv:2402.03300. Shen, W.; Zheng, R.; Zhan, W.; Zhao, J.; Dou, S.; Gui, T.; Zhang, Q.; and Huang, X

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2310.05199

Loose lips sink ships: Mit- igating length bias in reinforcement learning from human feedback. arXiv preprint arXiv:2310.05199. Singhal, P.; Goyal, T.; Xu, J.; and Durrett, G

work page arXiv
[19]

A long way to go: Investigating length correlations in rlhf,

A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716. Snell, C.; Lee, J.; Xu, K.; and Kumar, A

work page arXiv
[20]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scal- ing model parameters. arXiv preprint arXiv:2408.03314. Sun, Z.; Yu, L.; Shen, Y .; Liu, W.; Yang, Y .; Welleck, S.; and Gan, C

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv:2403.09472

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision. arXiv:2403.09472. Wang, J.; Fang, M.; Wan, Z.; Wen, M.; Zhu, J.; Liu, A.; Gong, Z.; Song, Y .; Chen, L.; Ni, L. M.; Yang, L.; Wen, Y .; and Zhang, W. 2024a. OpenR: An Open Source Frame- work for Advanced Reasoning with Large Language Mod- els. arXiv:2410.09671. Wang, P.; Li, L.; Sha...

work page arXiv
[22]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Inference scaling laws: An empirical analysis of compute- optimal inference for problem-solving with language mod- els. arXiv preprint arXiv:2408.00724. Xia, S.; Li, X.; Liu, Y .; Wu, T.; and Liu, P

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Evaluating mathematical reasoning beyond accuracy

Evaluating Mathematical Reasoning Beyond Accuracy. arXiv:2404.05692. Xiong, W.; Zhang, H.; Jiang, N.; and Zhang, T

work page arXiv
[24]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Zhang, L.; Hosseini, A.; Bansal, H.; Kazemi, M.; Kumar, A.; and Agarwal, R

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv:2408.15240

Generative Verifiers: Reward Mod- eling as Next-Token Prediction. arXiv:2408.15240. Zhao, J.; Liu, R.; Zhang, K.; Zhou, Z.; Gao, J.; Li, D.; Lyu, J.; Qian, Z.; Qi, B.; Li, X.; et al

work page arXiv
[26]

arXiv preprint arXiv:2504.00891

Genprm: Scaling test-time compute of process reward models via generative reasoning. arXiv preprint arXiv:2504.00891. Zheng, C.; Zhang, Z.; Zhang, B.; Lin, R.; Lu, K.; Yu, B.; Liu, D.; Zhou, J.; and Lin, J

work page arXiv
[27]

Processbench: Identifying process errors in mathematical reasoning

Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559. Zhu, J.; Zheng, C.; Lin, J.; Du, K.; Wen, Y .; Yu, Y .; Wang, J.; and Zhang, W

work page arXiv
[28]

arXiv preprint arXiv:2502.14361

Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning. arXiv preprint arXiv:2502.14361. Zhu, Q.; Guo, D.; Shao, Z.; Yang, D.; Wang, P.; Xu, R.; Wu, Y .; Li, Y .; Gao, H.; Ma, S.; et al

work page arXiv
[29]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-Coder- V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint arXiv:2406.11931. Experiment Details Example of Semi-synthetic Solution We generate extended variants either by duplicating the orig- inal step or by prompting DeepSeek (Liu et al

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609. Chen, L.; Zhu, C.; Soselia, D.; Chen, J.; Zhou, T.; Gold- stein, T.; Huang, H.; Shoeybi, M.; and Catanzaro, B

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al

work page arXiv

[3] [3]

The Llama 3 Herd of Models

The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Eisenstein, J.; Nagpal, C.; Agarwal, A.; Beirami, A.; D’Amour, A.; Dvijotham, D.; Fisch, A.; Heller, K.; Pfohl, S.; Ramachandran, D.; et al

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J

Helping or herding? re- ward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244. Gao, B.; Cai, Z.; Xu, R.; Wang, P.; Zheng, C.; Lin, R.; Lu, K.; Liu, D.; Zhou, C.; Xiao, W.; Hu, J.; Liu, T.; and Chang, B

work page arXiv

[5] [5]

arXiv:2406.14024

LLM Critics Help Catch Bugs in Mathematics: To- wards a Better Mathematical Verifier with Natural Language Feedback. arXiv:2406.14024. Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J

work page arXiv

[6] [6]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv. org/abs/2103.03874. Huang, Z.; Qiu, Z.; Wang, Z.; Ponti, E. M.; and Titov, I

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2409.17407

Post-hoc reward calibration: A case study on length bias. arXiv preprint arXiv:2409.17407. Li, Q.; Cui, L.; Zhao, X.; Kong, L.; and Bi, W

work page arXiv

[8] [8]

Li, W.; and Li, Y

Gsm- plus: A comprehensive benchmark for evaluating the robust- ness of llms as mathematical problem solvers.arXiv preprint arXiv:2402.19255. Li, W.; and Li, Y

work page arXiv

[9] [9]

arXiv preprint arXiv:2410.11287

Process reward model with q-value rankings. arXiv preprint arXiv:2410.11287. Lightman, H.; Kosaraju, V .; Burda, Y .; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; and Cobbe, K

work page arXiv

[10] [10]

Let's Verify Step by Step

Let’s verify step by step. arXiv preprint arXiv:2305.20050. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Luo, L.; Liu, Y .; Liu, R.; Phatale, S.; Guo, M.; Lara, H.; Li, Y .; Shu, L.; Zhu, Y .; Meng, L.; Sun, J.; and Rastogi, A

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Improve Mathematical Reasoning in Language Models by Automated Process Supervision. arXiv:2406.06592. McAleese, N.; Pokorny, R. M.; Uribe, J. F. C.; Nitishin- skaya, E.; Trebacz, M.; and Leike, J

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

LLM Critics Help Catch LLM Bugs,

LLM Critics Help Catch LLM Bugs. arXiv:2407.00215. o1 Team, S

work page arXiv

[14] [14]

GPT-4 Technical Report

Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5). Pearl, J

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Ram´e, A.; Ferret, J.; Vieillard, N.; Dadashi, R.; Hussenot, L.; Cedoz, P.-L.; Sessa, P

arXiv:2410.18982. Ram´e, A.; Ferret, J.; Vieillard, N.; Dadashi, R.; Hussenot, L.; Cedoz, P.-L.; Sessa, P. G.; Girgin, S.; Douillard, A.; and Bachem, O

work page arXiv

[16] [16]

arXiv preprint arXiv:2406.16768

Warp: On the benefits of weight averaged rewarded policies. arXiv preprint arXiv:2406.16768. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y .; Wu, Y .; et al

work page arXiv

[17] [17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models. arXiv preprint arXiv:2402.03300. Shen, W.; Zheng, R.; Zhan, W.; Zhao, J.; Dou, S.; Gui, T.; Zhang, Q.; and Huang, X

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2310.05199

Loose lips sink ships: Mit- igating length bias in reinforcement learning from human feedback. arXiv preprint arXiv:2310.05199. Singhal, P.; Goyal, T.; Xu, J.; and Durrett, G

work page arXiv

[19] [19]

A long way to go: Investigating length correlations in rlhf,

A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716. Snell, C.; Lee, J.; Xu, K.; and Kumar, A

work page arXiv

[20] [20]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scal- ing model parameters. arXiv preprint arXiv:2408.03314. Sun, Z.; Yu, L.; Shen, Y .; Liu, W.; Yang, Y .; Welleck, S.; and Gan, C

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv:2403.09472

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision. arXiv:2403.09472. Wang, J.; Fang, M.; Wan, Z.; Wen, M.; Zhu, J.; Liu, A.; Gong, Z.; Song, Y .; Chen, L.; Ni, L. M.; Yang, L.; Wen, Y .; and Zhang, W. 2024a. OpenR: An Open Source Frame- work for Advanced Reasoning with Large Language Mod- els. arXiv:2410.09671. Wang, P.; Li, L.; Sha...

work page arXiv

[22] [22]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Inference scaling laws: An empirical analysis of compute- optimal inference for problem-solving with language mod- els. arXiv preprint arXiv:2408.00724. Xia, S.; Li, X.; Liu, Y .; Wu, T.; and Liu, P

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Evaluating mathematical reasoning beyond accuracy

Evaluating Mathematical Reasoning Beyond Accuracy. arXiv:2404.05692. Xiong, W.; Zhang, H.; Jiang, N.; and Zhang, T

work page arXiv

[24] [24]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Zhang, L.; Hosseini, A.; Bansal, H.; Kazemi, M.; Kumar, A.; and Agarwal, R

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv:2408.15240

Generative Verifiers: Reward Mod- eling as Next-Token Prediction. arXiv:2408.15240. Zhao, J.; Liu, R.; Zhang, K.; Zhou, Z.; Gao, J.; Li, D.; Lyu, J.; Qian, Z.; Qi, B.; Li, X.; et al

work page arXiv

[26] [26]

arXiv preprint arXiv:2504.00891

Genprm: Scaling test-time compute of process reward models via generative reasoning. arXiv preprint arXiv:2504.00891. Zheng, C.; Zhang, Z.; Zhang, B.; Lin, R.; Lu, K.; Yu, B.; Liu, D.; Zhou, J.; and Lin, J

work page arXiv

[27] [27]

Processbench: Identifying process errors in mathematical reasoning

Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559. Zhu, J.; Zheng, C.; Lin, J.; Du, K.; Wen, Y .; Yu, Y .; Wang, J.; and Zhang, W

work page arXiv

[28] [28]

arXiv preprint arXiv:2502.14361

Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning. arXiv preprint arXiv:2502.14361. Zhu, Q.; Guo, D.; Shao, Z.; Yang, D.; Wang, P.; Xu, R.; Wu, Y .; Li, Y .; Gao, H.; Ma, S.; et al

work page arXiv

[29] [29]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-Coder- V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint arXiv:2406.11931. Experiment Details Example of Semi-synthetic Solution We generate extended variants either by duplicating the orig- inal step or by prompting DeepSeek (Liu et al

work page internal anchor Pith review Pith/arXiv arXiv