rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Kaijie Zhu; Lun Wang; Wenbo Guo; Xian Wu; Ying Zhang

arxiv: 2602.07832 · v2 · pith:GNOCPX6Bnew · submitted 2026-02-08 · 💻 cs.LG · cs.AI

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Xian Wu , Kaijie Zhu , Ying Zhang , Lun Wang , Wenbo Guo This is my paper

Pith reviewed 2026-05-21 13:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords process reward modelinverse reinforcement learningLLM reasoningPRM learningdual learning processonline offline unificationtest-time scaling

0 comments

The pith

rePIRL learns process reward models for LLM reasoning via inverse RL with minimal expert policy assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces rePIRL, an inverse reinforcement learning framework designed to train process reward models that supervise individual steps in large language model reasoning. Existing methods typically require detailed knowledge of expert policies such as their reward functions or suffer from training problems like entropy collapse that produce weak models. rePIRL instead runs a dual learning loop that alternates updates between the policy and the reward model, using specialized scaling techniques suited to language models. The authors prove that this single framework unifies both online and offline PRM training approaches. Experiments on standard math and coding benchmarks show improved performance, with further uses demonstrated in test-time training and scaling.

Core claim

rePIRL is an inverse RL framework that learns PRMs for LLM reasoning through a dual learning process which updates the policy and the PRM interchangeably. Customized techniques address scaling challenges when applying inverse RL to large language models, including avoidance of entropy collapse. The framework theoretically unifies online and offline PRM learning methods, enabling effective training under minimal assumptions about expert policies rather than requiring their reward functions. This is supported by empirical gains on math and coding reasoning datasets together with applications to test-time training, test-time scaling, and early signals for hard problems.

What carries the argument

The dual learning process that updates the policy and the PRM interchangeably, equipped with customized techniques to scale inverse RL to LLMs without entropy collapse.

If this is right

PRMs can be trained without access to expert reward functions or other strong policy details.
Online and offline PRM learning methods become unified inside one theoretical framework.
The resulting PRM improves performance when applied to test-time training and test-time scaling.
Early signals from the PRM can identify and prioritize training on hard reasoning problems.
Better results are obtained on standardized math and coding reasoning datasets than prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The minimal-assumption design may extend usefully to domains with noisy or incomplete expert traces, such as real-world user interaction data.
Similar dual-update loops could be tested on sequential tasks outside language, for instance in automated planning or strategy learning.
The unification result suggests hybrid online-offline training schedules as a practical next step for other reward-modeling settings.
One could measure whether the same recipe reduces reward hacking when the PRM is inserted into broader LLM alignment pipelines.

Load-bearing premise

The dual learning process with customized techniques for scaling inverse RL to LLMs avoids entropy collapse and other limitations without needing strong assumptions such as access to expert reward functions.

What would settle it

Training a PRM with rePIRL on a math reasoning dataset such as GSM8K and measuring no gain in step accuracy or final answer rate when the model is used to guide LLM inference compared with standard supervised baselines.

Figures

Figures reproduced from arXiv: 2602.07832 by Kaijie Zhu, Lun Wang, Wenbo Guo, Xian Wu, Ying Zhang.

**Figure 1.** Figure 1: Performance of three applications of our PRM (Section 4.3) and rePIRL without outcome reward (Section 4.4). prolonged period, indicating that outcome rewards provide limited useful feedback early on. In contrast, training with our PRM (rePIRL) yields measurable improvements early on, converges substantially faster, and achieves higher accuracy on hard problems. This highlights the utility of PRM when outc… view at source ↗

**Figure 2.** Figure 2: Comparison of rePIRL using Claude-3.7-Sonnet versus DeepSeek-R1 as expert trajectory generators. MATH-500 AIME-2024 Minerva Math AMC Olympiadbench Avg 0 20 40 60 Accuracy rePIRL (w/ IS) w/o IS (3 epochs) w/o IS (5 epochs) [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

read the original abstract

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

rePIRL frames PRM learning as inverse RL with dual policy-reward updates that claim to unify online and offline regimes under minimal expert assumptions, but the unification's exactness is the part that needs checking.

read the letter

The main point on this paper is that it sets up a dual learning loop where the policy and the process reward model update each other in an inverse RL style, and the authors argue this recovers both online and offline PRM methods without needing strong expert reward functions or running into entropy collapse right away. They add some LLM-specific scaling tricks to make the inverse RL workable at this size, then test on standard math and coding reasoning sets. They also show the trained PRM helping with test-time scaling, early signals on hard problems, and a few ablations on the design choices. That combination of unification claim plus downstream uses is what stands out as new relative to prior PRM work that either leaned hard on expert access or hit stability walls. The empirical side looks straightforward and the ablation helps pin down which pieces matter. The unification is the part that could matter if it really holds with light assumptions. On the soft spots, the theoretical reduction steps are not laid out in enough detail in the abstract to confirm whether the dual updates truly avoid reintroducing regularization terms that would limit the policy class or recreate the entropy issues they want to sidestep. The stress-test note on this is worth keeping in mind until the derivations are walked through. Empirically the gains are reported but without error bars or fuller baseline implementation notes it is hard to separate method from tuning. This is for people already working on process supervision for LLM reasoning chains who want a framework that tries to relax expert requirements. A reader focused on reward modeling or test-time compute would get concrete ideas from the applications and the recipe. It is worth sending to peer review so the theory can be verified and the experiments can be stress-tested on the details.

Referee Report

2 major / 2 minor

Summary. The paper introduces rePIRL, an inverse-RL-inspired framework for learning Process Reward Models (PRMs) to improve LLM reasoning. It proposes a dual learning process that alternately updates the policy and the PRM, together with customized scaling techniques for LLMs. The central theoretical claim is that this framework unifies online and offline PRM learning methods while requiring only minimal assumptions on expert policies. Empirically, the method is reported to outperform prior approaches on standardized math and coding reasoning benchmarks and is shown to be useful for test-time training, test-time scaling, and early detection of hard problems, with supporting ablation studies.

Significance. If the unification result is rigorously derived and the empirical gains prove robust, the work would supply a principled route to PRM learning that avoids both strong expert-reward assumptions and entropy-collapse pathologies. The unification of online and offline regimes under a single dual-update scheme, together with the demonstrated downstream uses in test-time computation, would constitute a substantive contribution to the literature on reward modeling for LLM reasoning.

major comments (2)

[Theoretical Analysis] Theoretical unification section: the claim that the dual process recovers both online and offline PRM objectives as special cases must be supported by explicit reduction steps. It remains unclear whether the customized regularizer or the LLM-specific parameterization re-introduces entropy-regularization assumptions that the abstract asserts are avoided.
[Experiments] Experimental results: superiority is asserted on math and coding datasets, yet the absence of reported standard deviations across multiple seeds, full ablation tables, and precise hyper-parameter settings for the dual updates makes it impossible to verify that the gains are not attributable to post-hoc fitting or implementation details.

minor comments (2)

[Abstract] The abstract refers to 'customized techniques' without naming them; a one-sentence enumeration would improve readability.
[Method] Notation for the policy-PRM interchange in the dual update could be accompanied by a compact algorithmic box or diagram to reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical unification section: the claim that the dual process recovers both online and offline PRM objectives as special cases must be supported by explicit reduction steps. It remains unclear whether the customized regularizer or the LLM-specific parameterization re-introduces entropy-regularization assumptions that the abstract asserts are avoided.

Authors: We agree that explicit reduction steps would strengthen the presentation. In the revised manuscript we will insert detailed derivations showing how the dual update recovers the online objective when the policy is updated first and the offline objective when the PRM is updated first, under the minimal assumptions stated in the paper. The customized regularizer is introduced only for numerical stability during LLM-scale optimization and does not encode entropy regularization on the expert policy; we will add a clarifying paragraph to rule out re-introduction of the assumptions we claim to avoid. revision: yes
Referee: [Experiments] Experimental results: superiority is asserted on math and coding datasets, yet the absence of reported standard deviations across multiple seeds, full ablation tables, and precise hyper-parameter settings for the dual updates makes it impossible to verify that the gains are not attributable to post-hoc fitting or implementation details.

Authors: We acknowledge that additional statistical detail is needed for full verification. The revision will report mean and standard deviation over at least three random seeds for all main results, expand the ablation study into a complete table, and move the precise hyper-parameter settings for the dual updates (including learning rates, regularization coefficients, and update frequencies) to a new appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in unification claim

full rationale

The paper's abstract and summary present a dual learning process for rePIRL that theoretically unifies online and offline PRM methods under minimal assumptions on expert policies. No equations, self-citations, or derivations are exhibited that reduce the central result to fitted inputs, self-definitions, or load-bearing prior work by the same authors. The framework is described with customized scaling techniques for inverse RL, and the unification is positioned as an independent theoretical justification rather than a renaming or identity-level reduction. This qualifies as a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper claims minimal assumptions about expert policies and introduces customized scaling techniques for inverse RL to LLMs; no specific free parameters, axioms, or invented entities are identifiable without the full text.

pith-pipeline@v0.9.0 · 5772 in / 1207 out tokens · 81986 ms · 2026-05-21T13:09:26.019087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 23 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., and Hooker, S. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

URL https://www.anthropic.com/ news/claude-3-7-sonnet. Arora, D. and Zanette, A. Training language models to rea- son efficiently.arXiv preprint arXiv:2502.04463,

work page arXiv
[3]

Process Reinforcement through Implicit Rewards

Cui, G., Yuan, L., Wang, Z., Wang, H., Li, W., He, B., Fan, Y ., Yu, T., Xu, Q., Chen, W., et al. Process re- inforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025a. Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., et al. The entropy mech- anism of reinforcement learning for reasoning ...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024a. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic op- timization, 2024.URL https://arxiv. org/abs/2402.01306, 2024b. Finn, C., Chr...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Heess, N., Sriram, S., Lemmon, J., Merel, J., Tassa, Y ., Erez, T., Wang, Z., Eslami, S. M. A., Riedmiller, M., and Silver, D. Emergence of locomotion behaviours in rich environments.arXiv preprint arXiv:1707.02286,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Hu, J. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

10 rePIRL: Learn PRM with Inverse RL for LLM Reasoning Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Enhancing multi-step reasoning abilities of language models through direct q-function optimization.arXiv preprint arXiv:2410.09302,

Ji, K., Liu, G., Dai, N., Yang, Q., Zheng, R., Wu, Z., Dun, C., Gu, Q., and Yan, L. Enhancing multi-step reasoning abilities of language models through direct q-function optimization.arXiv preprint arXiv:2410.09302,

work page arXiv
[14]

Know when to explore: Difficulty-aware certainty as a guide for llm reinforcement learning.arXiv preprint arXiv:2509.00125, 2025a

Li, A., Yuan, Z., Zhang, Y ., Liu, S., and Wang, Y . Know when to explore: Difficulty-aware certainty as a guide for llm reinforcement learning.arXiv preprint arXiv:2509.00125, 2025a. Li, D., Cao, S., Griggs, T., Liu, S., Mo, X., Tang, E., Hegde, S., Hakhamaneshi, K., Patil, S. G., Zaharia, M., et al. Llms can easily learn to reason from demonstrations st...

work page arXiv
[15]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H

Luong, T. Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 3,

work page arXiv
[17]

Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080,

Ma, Q., Zhou, H., Liu, T., Yuan, J., Liu, P., You, Y ., and Yang, H. Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080,

work page arXiv
[18]

Cot-valve: Length-compressible chain-of-thought tuning.arXiv preprint arXiv:2502.09601,

Ma, X., Wan, G., Yu, R., Fang, G., and Wang, X. Cot-valve: Length-compressible chain-of-thought tuning.arXiv preprint arXiv:2502.09601,

work page arXiv
[19]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R

Rafailov, R., Hejna, J., Park, R., and Finn, C. From r to q: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358,

work page arXiv
[21]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

11 rePIRL: Learn PRM with Inverse RL for LLM Reasoning Richemond, P. H., Tang, Y ., Guo, D., Calandriello, D., Azar, M. G., Rafailov, R., Pires, B. A., Tarassov, E., Spangher, L., Ellsworth, W., et al. Offline regularised reinforcement learning for large language models alignment.arXiv preprint arXiv:2405.19107,

work page arXiv
[22]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal rein- forcement learning.arXiv preprint arXiv:2303.11366,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2312.06585

Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P. J., Harrison, J., Lee, J., Xu, K., et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

work page arXiv
[26]

and van der Schaar, M

Sun, H. and van der Schaar, M. Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm align- ment.arXiv preprint arXiv:2405.15624,

work page arXiv
[27]

Solving math word problems with process- and outcome-based feedback

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Offline reinforcement learning for llm multi-step reasoning

Wang, H., Hao, S., Dong, H., Zhang, S., Bao, Y ., Yang, Z., and Wu, Y . Offline reinforcement learning for llm multi-step reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, 2025a. Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Verify and reinforce llms step-by-step without...

work page 2025
[29]

Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025b

Wang, Y ., Yue, X., and Chen, W. Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025b. Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning.Machine learning, 8:229–256,

work page arXiv
[30]

Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

Xia, H., Li, Y ., Leong, C. T., Wang, W., and Li, W. Token- skip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067,

work page arXiv
[31]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

Xiong, W., Dong, H., Ye, C., Wang, Z., Zhong, H., Ji, H., Jiang, N., and Zhang, T. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

work page arXiv
[32]

Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613,

Xiong, W., Zhang, H., Ye, C., Chen, L., Jiang, N., and Zhang, T. Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613, 2025a. Xiong, W., Zhao, W., Yuan, W., Golovneva, O., Zhang, T., Weston, J., and Sukhbaatar, S. Stepwiser: Stepwise generative judges for wiser reasoning.arXiv preprint arXiv:2508.19229, 2025b. Xu, D., Qiu, ...

work page arXiv
[33]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai...

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

LIMO: Less is More for Reasoning

12 rePIRL: Learn PRM with Inverse RL for LLM Reasoning Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Yeo, E., Tong, Y ., Niu, M., Neubig, G., and Yue, X. Demys- tifying long chain-of-thought reasoning in llms.arXiv preprint arXiv:2502.03373,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

S., Eom, S., Han, G., Nam, D

Yoon, E., Yoon, H. S., Eom, S., Han, G., Nam, D. W., Jo, D., On, K.-W., Hasegawa-Johnson, M. A., Kim, S., and Yoo, C. D. Tlcr: Token-level continuous reward for fine- grained reinforcement learning from human feedback. arXiv preprint arXiv:2407.16574,

work page arXiv
[38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Free process rewards without process labels.arXiv preprint arXiv:2412.01981,

Yuan, L., Li, W., Chen, H., Cui, G., Ding, N., Zhang, K., Zhou, B., Liu, Z., and Peng, H. Free process rewards without process labels.arXiv preprint arXiv:2412.01981,

work page arXiv
[40]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., Wei, X., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

From demonstrations to rewards: Align- ment without explicit human preferences.arXiv preprint arXiv:2503.13538,

Zeng, S., Liu, Y ., Rangwala, H., Karypis, G., Hong, M., and Fakoor, R. From demonstrations to rewards: Align- ment without explicit human preferences.arXiv preprint arXiv:2503.13538,

work page arXiv
[42]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

Zeng, Y ., Liu, G., Ma, W., Yang, N., Zhang, H., and Wang, J. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

work page arXiv
[43]

Rl tango: Reinforcing generator and verifier together for language reasoning

Zha, K., Gao, Z., Shen, M., Hong, Z.-W., Boning, D. S., and Katabi, D. Rl tango: Reinforcing generator and verifier together for language reasoning.arXiv preprint arXiv:2505.15034,

work page arXiv
[44]

Rest-mcts*: Llm self-training via process re- ward guided tree search.Advances in Neural Information Processing Systems, 37:64735–64772, 2024a

Zhang, D., Zhoubian, S., Hu, Z., Yue, Y ., Dong, Y ., and Tang, J. Rest-mcts*: Llm self-training via process re- ward guided tree search.Advances in Neural Information Processing Systems, 37:64735–64772, 2024a. Zhang, H., Wang, P., Diao, S., Lin, Y ., Pan, R., Dong, H., Zhang, D., Molchanov, P., and Zhang, T. Entropy- regularized process reward model.arXi...

work page arXiv
[45]

For the testing-time scaling experiments in Section C, we set the temperature to 0.8 while keeping all other hyper-parameters unchanged

framework to reduce memory usage and accelerate computation. For the testing-time scaling experiments in Section C, we set the temperature to 0.8 while keeping all other hyper-parameters unchanged. C. Ablation study Unless otherwise specified, all ablation study experiments are conducted using theQwen2.5-3B-Instructmodel. 16 rePIRL: Learn PRM with Inverse...

work page 2024
[46]

Nevertheless, our approach still outperforms the RLOO baselines, demonstrating that rePIRL generalizes across different reward model architectures and sizes

From this table, we observe that replacing the reward model with a smaller one degrades performance. Nevertheless, our approach still outperforms the RLOO baselines, demonstrating that rePIRL generalizes across different reward model architectures and sizes. We note that using Qwen models for experiments and ablation is standard practice, as the Qwen fami...

work page 2024

[1] [1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., and Hooker, S. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

URL https://www.anthropic.com/ news/claude-3-7-sonnet. Arora, D. and Zanette, A. Training language models to rea- son efficiently.arXiv preprint arXiv:2502.04463,

work page arXiv

[3] [3]

Process Reinforcement through Implicit Rewards

Cui, G., Yuan, L., Wang, Z., Wang, H., Li, W., He, B., Fan, Y ., Yu, T., Xu, Q., Chen, W., et al. Process re- inforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025a. Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., et al. The entropy mech- anism of reinforcement learning for reasoning ...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024a. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic op- timization, 2024.URL https://arxiv. org/abs/2402.01306, 2024b. Finn, C., Chr...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Heess, N., Sriram, S., Lemmon, J., Merel, J., Tassa, Y ., Erez, T., Wang, Z., Eslami, S. M. A., Riedmiller, M., and Silver, D. Emergence of locomotion behaviours in rich environments.arXiv preprint arXiv:1707.02286,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Hu, J. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

10 rePIRL: Learn PRM with Inverse RL for LLM Reasoning Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Enhancing multi-step reasoning abilities of language models through direct q-function optimization.arXiv preprint arXiv:2410.09302,

Ji, K., Liu, G., Dai, N., Yang, Q., Zheng, R., Wu, Z., Dun, C., Gu, Q., and Yan, L. Enhancing multi-step reasoning abilities of language models through direct q-function optimization.arXiv preprint arXiv:2410.09302,

work page arXiv

[14] [14]

Know when to explore: Difficulty-aware certainty as a guide for llm reinforcement learning.arXiv preprint arXiv:2509.00125, 2025a

Li, A., Yuan, Z., Zhang, Y ., Liu, S., and Wang, Y . Know when to explore: Difficulty-aware certainty as a guide for llm reinforcement learning.arXiv preprint arXiv:2509.00125, 2025a. Li, D., Cao, S., Griggs, T., Liu, S., Mo, X., Tang, E., Hegde, S., Hakhamaneshi, K., Patil, S. G., Zaharia, M., et al. Llms can easily learn to reason from demonstrations st...

work page arXiv

[15] [15]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H

Luong, T. Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967, 3,

work page arXiv

[17] [17]

Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080,

Ma, Q., Zhou, H., Liu, T., Yuan, J., Liu, P., You, Y ., and Yang, H. Let’s reward step by step: Step-level reward model as the navigators for reasoning.arXiv preprint arXiv:2310.10080,

work page arXiv

[18] [18]

Cot-valve: Length-compressible chain-of-thought tuning.arXiv preprint arXiv:2502.09601,

Ma, X., Wan, G., Yu, R., Fang, G., and Wang, X. Cot-valve: Length-compressible chain-of-thought tuning.arXiv preprint arXiv:2502.09601,

work page arXiv

[19] [19]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R

Rafailov, R., Hejna, J., Park, R., and Finn, C. From r to q: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358,

work page arXiv

[21] [21]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

11 rePIRL: Learn PRM with Inverse RL for LLM Reasoning Richemond, P. H., Tang, Y ., Guo, D., Calandriello, D., Azar, M. G., Rafailov, R., Pires, B. A., Tarassov, E., Spangher, L., Ellsworth, W., et al. Offline regularised reinforcement learning for large language models alignment.arXiv preprint arXiv:2405.19107,

work page arXiv

[22] [22]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal rein- forcement learning.arXiv preprint arXiv:2303.11366,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2312.06585

Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P. J., Harrison, J., Lee, J., Xu, K., et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

work page arXiv

[26] [26]

and van der Schaar, M

Sun, H. and van der Schaar, M. Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm align- ment.arXiv preprint arXiv:2405.15624,

work page arXiv

[27] [27]

Solving math word problems with process- and outcome-based feedback

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solv- ing math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Offline reinforcement learning for llm multi-step reasoning

Wang, H., Hao, S., Dong, H., Zhang, S., Bao, Y ., Yang, Z., and Wu, Y . Offline reinforcement learning for llm multi-step reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, 2025a. Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Verify and reinforce llms step-by-step without...

work page 2025

[29] [29]

Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025b

Wang, Y ., Yue, X., and Chen, W. Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025b. Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning.Machine learning, 8:229–256,

work page arXiv

[30] [30]

Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

Xia, H., Li, Y ., Leong, C. T., Wang, W., and Li, W. Token- skip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067,

work page arXiv

[31] [31]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

Xiong, W., Dong, H., Ye, C., Wang, Z., Zhong, H., Ji, H., Jiang, N., and Zhang, T. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

work page arXiv

[32] [32]

Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613,

Xiong, W., Zhang, H., Ye, C., Chen, L., Jiang, N., and Zhang, T. Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613, 2025a. Xiong, W., Zhao, W., Yuan, W., Golovneva, O., Zhang, T., Weston, J., and Sukhbaatar, S. Stepwiser: Stepwise generative judges for wiser reasoning.arXiv preprint arXiv:2508.19229, 2025b. Xu, D., Qiu, ...

work page arXiv

[33] [33]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai...

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

LIMO: Less is More for Reasoning

12 rePIRL: Learn PRM with Inverse RL for LLM Reasoning Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Yeo, E., Tong, Y ., Niu, M., Neubig, G., and Yue, X. Demys- tifying long chain-of-thought reasoning in llms.arXiv preprint arXiv:2502.03373,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

S., Eom, S., Han, G., Nam, D

Yoon, E., Yoon, H. S., Eom, S., Han, G., Nam, D. W., Jo, D., On, K.-W., Hasegawa-Johnson, M. A., Kim, S., and Yoo, C. D. Tlcr: Token-level continuous reward for fine- grained reinforcement learning from human feedback. arXiv preprint arXiv:2407.16574,

work page arXiv

[38] [38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Free process rewards without process labels.arXiv preprint arXiv:2412.01981,

Yuan, L., Li, W., Chen, H., Cui, G., Ding, N., Zhang, K., Zhou, B., Liu, Z., and Peng, H. Free process rewards without process labels.arXiv preprint arXiv:2412.01981,

work page arXiv

[40] [40]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., Wei, X., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

From demonstrations to rewards: Align- ment without explicit human preferences.arXiv preprint arXiv:2503.13538,

Zeng, S., Liu, Y ., Rangwala, H., Karypis, G., Hong, M., and Fakoor, R. From demonstrations to rewards: Align- ment without explicit human preferences.arXiv preprint arXiv:2503.13538,

work page arXiv

[42] [42]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

Zeng, Y ., Liu, G., Ma, W., Yang, N., Zhang, H., and Wang, J. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

work page arXiv

[43] [43]

Rl tango: Reinforcing generator and verifier together for language reasoning

Zha, K., Gao, Z., Shen, M., Hong, Z.-W., Boning, D. S., and Katabi, D. Rl tango: Reinforcing generator and verifier together for language reasoning.arXiv preprint arXiv:2505.15034,

work page arXiv

[44] [44]

Rest-mcts*: Llm self-training via process re- ward guided tree search.Advances in Neural Information Processing Systems, 37:64735–64772, 2024a

Zhang, D., Zhoubian, S., Hu, Z., Yue, Y ., Dong, Y ., and Tang, J. Rest-mcts*: Llm self-training via process re- ward guided tree search.Advances in Neural Information Processing Systems, 37:64735–64772, 2024a. Zhang, H., Wang, P., Diao, S., Lin, Y ., Pan, R., Dong, H., Zhang, D., Molchanov, P., and Zhang, T. Entropy- regularized process reward model.arXi...

work page arXiv

[45] [45]

For the testing-time scaling experiments in Section C, we set the temperature to 0.8 while keeping all other hyper-parameters unchanged

framework to reduce memory usage and accelerate computation. For the testing-time scaling experiments in Section C, we set the temperature to 0.8 while keeping all other hyper-parameters unchanged. C. Ablation study Unless otherwise specified, all ablation study experiments are conducted using theQwen2.5-3B-Instructmodel. 16 rePIRL: Learn PRM with Inverse...

work page 2024

[46] [46]

Nevertheless, our approach still outperforms the RLOO baselines, demonstrating that rePIRL generalizes across different reward model architectures and sizes

From this table, we observe that replacing the reward model with a smaller one degrades performance. Nevertheless, our approach still outperforms the RLOO baselines, demonstrating that rePIRL generalizes across different reward model architectures and sizes. We note that using Qwen models for experiments and ablation is standard practice, as the Qwen fami...

work page 2024