arxiv: 2604.10701 · v1 · submitted 2026-04-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Zikang Shan , Han Zhong , Liwei Wang , Li Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords generative criticsvalue modelingLLM reinforcement learningchain-of-thought reasoningactor-critic methodscredit assignmentin-context conditioning

0 comments

The pith

Generative chain-of-thought critics replace one-shot value prediction and improve credit assignment in LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conventional value models fail to scale reliably in LLM RL because one-shot scalar prediction lacks sufficient expressiveness for complex value functions. It introduces a generative critic that first produces chain-of-thought reasoning and then outputs a value estimate, kept calibrated to the evolving actor via in-context conditioning. These changes yield better value approximation, more reliable action ranking, and stronger out-of-distribution generalization than both traditional value-based and value-free methods. The improvements carry through to higher downstream policy performance on RL tasks.

Core claim

A generative actor-critic framework replaces one-shot scalar value prediction with chain-of-thought reasoning followed by a value estimate, augmented by in-context conditioning that keeps the critic aligned with the current actor throughout training; this produces more accurate value functions, better ranking, and improved generalization that translate into stronger RL performance than value-based or value-free baselines.

What carries the argument

Generative Actor-Critic (GenAC) with In-Context Conditioning, which performs step-by-step reasoning before value estimation and conditions the critic on the actor to maintain calibration.

If this is right

More accurate advantage estimates become available for credit assignment in long-horizon LLM tasks.
Action ranking improves, allowing policies to select higher-value continuations more consistently.
Out-of-distribution states receive more reliable value signals, supporting robust generalization.
RL training can retain the benefits of value modeling without reverting to purely value-free methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generative-critic pattern could be tested in non-LLM RL domains where one-shot value heads underperform.
If chain-of-thought value reasoning proves stable, future LLM RL pipelines may default to generative rather than discriminative critics.
In-context conditioning might reduce the frequency of critic retraining or reset steps during long training runs.

Load-bearing premise

The main difficulty with value models comes from limited expressiveness in one-shot prediction rather than other training instabilities, and generative critics can be trained reliably while staying calibrated via in-context conditioning.

What would settle it

An experiment in which scaled-up one-shot value models achieve comparable gains in approximation accuracy, ranking reliability, and downstream RL performance would show that generative critics are not required.

Figures

Figures reproduced from arXiv: 2604.10701 by Han Zhong, Liwei Wang, Li Zhao, Zikang Shan.

**Figure 2.** Figure 2: (a) Comparison of approximation performance between discriminative critics and generative critics. Both critics are trained to fit the same fixed value function until convergence. We repeat each setup 10 times with different random seeds, and visualize the mean and standard deviation of validation mean squared error. Notably, generative critics are more robust to randomness, scale better with model size, a… view at source ↗

**Figure 3.** Figure 3: Our proposed prompt template. ICC hints are colored in red. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: A case where the generative critic accurately detects a conceptual error. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenAC's shift to generative CoT critics plus in-context conditioning is a concrete attempt to fix value modeling in LLM RL, but the calibration claim during policy shifts needs tighter evidence.

read the letter

The paper's punchline is that standard one-shot value heads in LLM RL are limited by expressiveness, as shown by their scaling experiments where bigger models do not reliably improve, and that switching to a generative critic doing explicit chain-of-thought before outputting a value, plus in-context conditioning to track the actor, produces better value estimates and downstream RL gains. This is a direct, non-routine move away from the usual discriminative critic setup in the cited LLM RL work. The motivation from representation complexity theory is reasonable and the empirical observation that conventional critics plateau is worth having on record. It gives credit to the idea that LLMs' generative strengths can be turned on the critic itself rather than just the actor. The downstream improvements over both value-based and value-free baselines are the part that would matter most if they hold. The soft spot is the in-context conditioning step. Prompt-level context without parameter updates may not keep the critic aligned once the actor distribution moves, and the abstract gives no architecture details, update frequency, or ablations that would show whether the CoT reasoning stays consistent or starts producing biased advantages. Without those, it is hard to separate real credit-assignment gains from early-training artifacts. The work is aimed at groups doing RL post-training for reasoning or alignment models. It shows honest engagement with the credit-assignment problem and cites relevant theory, so it deserves a serious referee to check the implementation and run the necessary controls.

Referee Report

2 major / 1 minor

Summary. The paper claims that conventional discriminative value models in LLM RL are difficult to train reliably due to limited expressiveness under one-shot scalar prediction, as suggested by representation complexity theory and supported by scaling experiments showing unreliable gains with scale. It proposes Generative Actor-Critic (GenAC), which uses a generative critic performing chain-of-thought reasoning before value estimation, combined with in-context conditioning to maintain calibration to the evolving actor policy. The work reports that GenAC yields better value approximation, ranking reliability, and out-of-distribution generalization, translating to stronger downstream RL performance than both value-based and value-free baselines.

Significance. If the empirical results hold with proper controls, this could meaningfully revive value-based methods for credit assignment in LLM RL, offering a path to improved sample efficiency and performance by leveraging generative reasoning for more expressive value functions. The scaling experiments and introduction of in-context conditioning provide concrete empirical grounding for the representation-complexity motivation, distinguishing this from purely theoretical proposals.

major comments (2)

[§3.2] §3.2 (In-Context Conditioning): The mechanism relies on prompt-level context without updating critic parameters to the new policy distribution. This leaves open whether calibration holds under actor-induced distribution shift during RL training, which is load-bearing for the central claim that improved value modeling produces better advantage estimates and downstream returns (as noted in the abstract).
[Abstract / §5] Abstract and §5 (Scaling Experiments and Downstream Results): The reported scaling experiments and RL performance gains are described at a high level without details on architectures, training procedures, statistical tests, ablation controls, or exact metrics for value approximation and ranking. This makes it difficult to evaluate the reliability of the claim that standard critics do not improve with scale or that GenAC gains are robust.

minor comments (1)

[§2 / §3] The abstract and introduction introduce 'Generative Critic' and 'In-Context Conditioning' as novel terms; a brief formal definition or pseudocode in §2 or §3 would improve clarity for readers unfamiliar with the representation complexity framing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments highlight important aspects of our methodology and presentation that we address below. We have revised the manuscript to incorporate additional details and analysis where needed.

read point-by-point responses

Referee: [§3.2] §3.2 (In-Context Conditioning): The mechanism relies on prompt-level context without updating critic parameters to the new policy distribution. This leaves open whether calibration holds under actor-induced distribution shift during RL training, which is load-bearing for the central claim that improved value modeling produces better advantage estimates and downstream returns (as noted in the abstract).

Authors: We appreciate the referee's emphasis on distribution shift, which is indeed central to our claims. Although in-context conditioning uses prompt-level context without parameter updates to the critic, our experiments in §5 demonstrate that this approach maintains effective calibration throughout RL training. Specifically, GenAC exhibits sustained improvements in value approximation accuracy and advantage ranking reliability as the actor policy evolves, outperforming both standard value-based and value-free baselines in downstream returns. These results provide empirical support that the conditioning mechanism adapts to the shifting distribution. In the revised manuscript, we have added a new paragraph in §3.2 and supplementary plots in §5 showing the correlation between critic estimates and Monte Carlo returns over training steps to further substantiate calibration under shift. revision: partial
Referee: [Abstract / §5] Abstract and §5 (Scaling Experiments and Downstream Results): The reported scaling experiments and RL performance gains are described at a high level without details on architectures, training procedures, statistical tests, ablation controls, or exact metrics for value approximation and ranking. This makes it difficult to evaluate the reliability of the claim that standard critics do not improve with scale or that GenAC gains are robust.

Authors: We agree that the original presentation of the scaling experiments and RL results in the abstract and §5 was insufficiently detailed for rigorous evaluation. The revised manuscript now includes expanded descriptions in §5: full model architectures and hyperparameters for both actor and critic, training procedures for the RL loop, statistical tests (including paired t-tests with reported p-values and confidence intervals), ablation studies isolating the contributions of chain-of-thought reasoning and in-context conditioning, and precise metrics (MSE for value approximation, Kendall tau for ranking reliability, and normalized returns for downstream performance). These additions directly address the concerns and allow readers to assess the robustness of the finding that standard critics scale unreliably while GenAC gains persist. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent experimental validation

full rationale

The paper frames GenAC as an empirical method motivated by cited representation complexity theory (one-shot scalar prediction hardness) and validated through scaling experiments plus downstream RL benchmarks. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed known results; value estimates, ranking metrics, and RL returns are measured directly against baselines rather than derived tautologically. The in-context conditioning mechanism is introduced as a practical alignment technique without any uniqueness theorem or ansatz smuggled via prior self-work. This is a standard self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that value functions suffer from expressiveness limits under one-shot prediction and that generative reasoning can overcome this; no explicit free parameters or invented physical entities are introduced in the abstract.

axioms (1)

domain assumption Value functions are difficult to approximate under the one-shot prediction paradigm used by existing value models, per representation complexity theory.
Invoked to explain why conventional critics fail to scale and to motivate the generative approach.

invented entities (2)

Generative Critic no independent evidence
purpose: Performs chain-of-thought reasoning before producing value estimates instead of direct scalar prediction.
Core new component proposed to increase expressiveness.
In-Context Conditioning no independent evidence
purpose: Keeps the critic calibrated to the current actor policy throughout training.
Introduced to address potential misalignment between critic and actor.

pith-pipeline@v0.9.0 · 5517 in / 1402 out tokens · 31581 ms · 2026-05-10T15:04:47.646335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 22 canonical work pages · 12 internal anchors

[1]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review arXiv 2025
[4]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[5]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022

2022
[8]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 10

1998
[9]

Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

Vijay Konda and John Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

1999
[10]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review arXiv 2015
[12]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

2024
[13]

The epoch-greedy algorithm for multi-armed bandits with side information.Advances in neural information processing systems, 20, 2007

John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.Advances in neural information processing systems, 20, 2007

2007
[14]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

work page arXiv 2025
[15]

Vrpo: Rethinking value modeling for robust rl training under noisy supervision.arXiv preprint arXiv:2508.03058, 2025

Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, et al. Vrpo: Rethinking value modeling for robust rl training under noisy supervision.arXiv preprint arXiv:2508.03058, 2025

work page arXiv 2025
[16]

Segmental advantage estimation: Enhancing ppo for long-context llm training.arXiv preprint arXiv:2601.07320, 2026

Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, et al. Segmental advantage estimation: Enhancing ppo for long-context llm training.arXiv preprint arXiv:2601.07320, 2026

work page arXiv 2026
[17]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review arXiv 2025
[18]

Truncated proximal policy optimization

Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, et al. Truncated proximal policy optimization.arXiv preprint arXiv:2506.15050, 2025

work page arXiv 2025
[19]

Asymmetric proximal policy optimization: mini-critics boost llm reasoning.arXiv preprint arXiv:2510.01656, 2025

Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Asymmetric proximal policy optimization: mini-critics boost llm reasoning.arXiv preprint arXiv:2510.01656, 2025

work page arXiv 2025
[20]

Vineppo: Refining credit assignment in rl training of llms

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InInternational Conference on Machine Learning, pages 29557–29590. PMLR, 2025

2025
[21]

Segment policy optimization: Effective segment-level credit assignment in rl for large language models

Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[22]

Group-in-group policy optimization for llm agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[23]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[24]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

2024
[25]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024. 11

work page internal anchor Pith review arXiv 2024
[26]

Dpo meets ppo: Reinforced token optimization for rlhf

Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. Dpo meets ppo: Reinforced token optimization for rlhf. InInternational Conference on Machine Learning, pages 78498–78521. PMLR, 2025

2025
[27]

Brite: Bootstrapping reinforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, et al. Brite: Bootstrapping reinforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

work page arXiv 2025
[28]

Free process rewards without process labels

Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. InInternational Conference on Machine Learning, pages 73511–73525. PMLR, 2025

2025
[29]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review arXiv 2025
[30]

Optimizing test-time compute via meta reinforcement finetuning

Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement finetuning. InInternational Conference on Machine Learning, pages 50893–50925. PMLR, 2025

2025
[31]

Generative verifiers: Reward modeling as next-token prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[32]

Generative reward models.arXiv preprint arXiv:2410.12832, 2024

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models.arXiv preprint arXiv:2410.12832, 2024

work page arXiv 2024
[33]

Inference-time scaling for generalist reward modeling, 2025

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025

work page arXiv 2025
[34]

Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

work page arXiv 2025
[35]

Reward reasoning models

Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Reward reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[36]

Genprm: Scaling test-time compute of process reward models via generative reasoning

Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, et al. Genprm: Scaling test-time compute of process reward models via generative reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34932–34940, 2026

2026
[37]

Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

work page arXiv 2025
[38]

Rl tango: Reinforcing generator and verifier together for language reasoning

Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S Boning, and Dina Katabi. Rl tango: Reinforcing generator and verifier together for language reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[39]

Guhao Feng and Han Zhong. Rethinking model-based, policy-based, and value-based rein- forcement learning via the lens of representation complexity.Advances in Neural Information Processing Systems, 37:8569–8611, 2024

2024
[40]

The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

2023
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview- with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca30...

2025
[43]

Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

2023
[44]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations, 2024

2024
[45]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Preference fine-tuning of llms should leverage suboptimal, on-policy data

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. InInternational Conference on Machine Learning, pages 47441– 47474. PMLR, 2024

2024
[47]

Sft memorizes, rl generalizes: A comparative study of foundation model post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. InInternational Conference on Machine Learning, pages 10818–10838. PMLR, 2025

2025
[48]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8:229–256, 1992

1992
[49]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

2021
[50]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

2024
[51]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

2022
[52]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

2024
[53]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review arXiv 2024
[54]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[55]

PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848...

2023