pith. machine review for the scientific record. sign in

arxiv: 2604.10701 · v1 · submitted 2026-04-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords generative criticsvalue modelingLLM reinforcement learningchain-of-thought reasoningactor-critic methodscredit assignmentin-context conditioning
0
0 comments X

The pith

Generative chain-of-thought critics replace one-shot value prediction and improve credit assignment in LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conventional value models fail to scale reliably in LLM RL because one-shot scalar prediction lacks sufficient expressiveness for complex value functions. It introduces a generative critic that first produces chain-of-thought reasoning and then outputs a value estimate, kept calibrated to the evolving actor via in-context conditioning. These changes yield better value approximation, more reliable action ranking, and stronger out-of-distribution generalization than both traditional value-based and value-free methods. The improvements carry through to higher downstream policy performance on RL tasks.

Core claim

A generative actor-critic framework replaces one-shot scalar value prediction with chain-of-thought reasoning followed by a value estimate, augmented by in-context conditioning that keeps the critic aligned with the current actor throughout training; this produces more accurate value functions, better ranking, and improved generalization that translate into stronger RL performance than value-based or value-free baselines.

What carries the argument

Generative Actor-Critic (GenAC) with In-Context Conditioning, which performs step-by-step reasoning before value estimation and conditions the critic on the actor to maintain calibration.

If this is right

  • More accurate advantage estimates become available for credit assignment in long-horizon LLM tasks.
  • Action ranking improves, allowing policies to select higher-value continuations more consistently.
  • Out-of-distribution states receive more reliable value signals, supporting robust generalization.
  • RL training can retain the benefits of value modeling without reverting to purely value-free methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generative-critic pattern could be tested in non-LLM RL domains where one-shot value heads underperform.
  • If chain-of-thought value reasoning proves stable, future LLM RL pipelines may default to generative rather than discriminative critics.
  • In-context conditioning might reduce the frequency of critic retraining or reset steps during long training runs.

Load-bearing premise

The main difficulty with value models comes from limited expressiveness in one-shot prediction rather than other training instabilities, and generative critics can be trained reliably while staying calibrated via in-context conditioning.

What would settle it

An experiment in which scaled-up one-shot value models achieve comparable gains in approximation accuracy, ranking reliability, and downstream RL performance would show that generative critics are not required.

Figures

Figures reproduced from arXiv: 2604.10701 by Han Zhong, Liwei Wang, Li Zhao, Zikang Shan.

Figure 1
Figure 1. Figure 1: Comparison of advantage estimation under value-free methods, PPO, and GenAC, on a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Comparison of approximation performance between discriminative critics and generative critics. Both critics are trained to fit the same fixed value function until convergence. We repeat each setup 10 times with different random seeds, and visualize the mean and standard deviation of validation mean squared error. Notably, generative critics are more robust to randomness, scale better with model size, a… view at source ↗
Figure 3
Figure 3. Figure 3: Our proposed prompt template. ICC hints are colored in red. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A case where the generative critic accurately detects a conceptual error. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that conventional discriminative value models in LLM RL are difficult to train reliably due to limited expressiveness under one-shot scalar prediction, as suggested by representation complexity theory and supported by scaling experiments showing unreliable gains with scale. It proposes Generative Actor-Critic (GenAC), which uses a generative critic performing chain-of-thought reasoning before value estimation, combined with in-context conditioning to maintain calibration to the evolving actor policy. The work reports that GenAC yields better value approximation, ranking reliability, and out-of-distribution generalization, translating to stronger downstream RL performance than both value-based and value-free baselines.

Significance. If the empirical results hold with proper controls, this could meaningfully revive value-based methods for credit assignment in LLM RL, offering a path to improved sample efficiency and performance by leveraging generative reasoning for more expressive value functions. The scaling experiments and introduction of in-context conditioning provide concrete empirical grounding for the representation-complexity motivation, distinguishing this from purely theoretical proposals.

major comments (2)
  1. [§3.2] §3.2 (In-Context Conditioning): The mechanism relies on prompt-level context without updating critic parameters to the new policy distribution. This leaves open whether calibration holds under actor-induced distribution shift during RL training, which is load-bearing for the central claim that improved value modeling produces better advantage estimates and downstream returns (as noted in the abstract).
  2. [Abstract / §5] Abstract and §5 (Scaling Experiments and Downstream Results): The reported scaling experiments and RL performance gains are described at a high level without details on architectures, training procedures, statistical tests, ablation controls, or exact metrics for value approximation and ranking. This makes it difficult to evaluate the reliability of the claim that standard critics do not improve with scale or that GenAC gains are robust.
minor comments (1)
  1. [§2 / §3] The abstract and introduction introduce 'Generative Critic' and 'In-Context Conditioning' as novel terms; a brief formal definition or pseudocode in §2 or §3 would improve clarity for readers unfamiliar with the representation complexity framing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments highlight important aspects of our methodology and presentation that we address below. We have revised the manuscript to incorporate additional details and analysis where needed.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (In-Context Conditioning): The mechanism relies on prompt-level context without updating critic parameters to the new policy distribution. This leaves open whether calibration holds under actor-induced distribution shift during RL training, which is load-bearing for the central claim that improved value modeling produces better advantage estimates and downstream returns (as noted in the abstract).

    Authors: We appreciate the referee's emphasis on distribution shift, which is indeed central to our claims. Although in-context conditioning uses prompt-level context without parameter updates to the critic, our experiments in §5 demonstrate that this approach maintains effective calibration throughout RL training. Specifically, GenAC exhibits sustained improvements in value approximation accuracy and advantage ranking reliability as the actor policy evolves, outperforming both standard value-based and value-free baselines in downstream returns. These results provide empirical support that the conditioning mechanism adapts to the shifting distribution. In the revised manuscript, we have added a new paragraph in §3.2 and supplementary plots in §5 showing the correlation between critic estimates and Monte Carlo returns over training steps to further substantiate calibration under shift. revision: partial

  2. Referee: [Abstract / §5] Abstract and §5 (Scaling Experiments and Downstream Results): The reported scaling experiments and RL performance gains are described at a high level without details on architectures, training procedures, statistical tests, ablation controls, or exact metrics for value approximation and ranking. This makes it difficult to evaluate the reliability of the claim that standard critics do not improve with scale or that GenAC gains are robust.

    Authors: We agree that the original presentation of the scaling experiments and RL results in the abstract and §5 was insufficiently detailed for rigorous evaluation. The revised manuscript now includes expanded descriptions in §5: full model architectures and hyperparameters for both actor and critic, training procedures for the RL loop, statistical tests (including paired t-tests with reported p-values and confidence intervals), ablation studies isolating the contributions of chain-of-thought reasoning and in-context conditioning, and precise metrics (MSE for value approximation, Kendall tau for ranking reliability, and normalized returns for downstream performance). These additions directly address the concerns and allow readers to assess the robustness of the finding that standard critics scale unreliably while GenAC gains persist. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent experimental validation

full rationale

The paper frames GenAC as an empirical method motivated by cited representation complexity theory (one-shot scalar prediction hardness) and validated through scaling experiments plus downstream RL benchmarks. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed known results; value estimates, ranking metrics, and RL returns are measured directly against baselines rather than derived tautologically. The in-context conditioning mechanism is introduced as a practical alignment technique without any uniqueness theorem or ansatz smuggled via prior self-work. This is a standard self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that value functions suffer from expressiveness limits under one-shot prediction and that generative reasoning can overcome this; no explicit free parameters or invented physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Value functions are difficult to approximate under the one-shot prediction paradigm used by existing value models, per representation complexity theory.
    Invoked to explain why conventional critics fail to scale and to motivate the generative approach.
invented entities (2)
  • Generative Critic no independent evidence
    purpose: Performs chain-of-thought reasoning before producing value estimates instead of direct scalar prediction.
    Core new component proposed to increase expressiveness.
  • In-Context Conditioning no independent evidence
    purpose: Keeps the critic calibrated to the current actor policy throughout training.
    Introduced to address potential misalignment between critic and actor.

pith-pipeline@v0.9.0 · 5517 in / 1402 out tokens · 31581 ms · 2026-05-10T15:04:47.646335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  4. [4]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  5. [5]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  6. [6]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  7. [7]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022

  8. [8]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 10

  9. [9]

    Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

    Vijay Konda and John Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

  10. [10]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  11. [11]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  12. [12]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  13. [13]

    The epoch-greedy algorithm for multi-armed bandits with side information.Advances in neural information processing systems, 20, 2007

    John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.Advances in neural information processing systems, 20, 2007

  14. [14]

    What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

  15. [15]

    Vrpo: Rethinking value modeling for robust rl training under noisy supervision.arXiv preprint arXiv:2508.03058, 2025

    Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, et al. Vrpo: Rethinking value modeling for robust rl training under noisy supervision.arXiv preprint arXiv:2508.03058, 2025

  16. [16]

    Segmental advantage estimation: Enhancing ppo for long-context llm training.arXiv preprint arXiv:2601.07320, 2026

    Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, et al. Segmental advantage estimation: Enhancing ppo for long-context llm training.arXiv preprint arXiv:2601.07320, 2026

  17. [17]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

  18. [18]

    Truncated proximal policy optimization

    Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, et al. Truncated proximal policy optimization.arXiv preprint arXiv:2506.15050, 2025

  19. [19]

    Asymmetric proximal policy optimization: mini-critics boost llm reasoning.arXiv preprint arXiv:2510.01656, 2025

    Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Asymmetric proximal policy optimization: mini-critics boost llm reasoning.arXiv preprint arXiv:2510.01656, 2025

  20. [20]

    Vineppo: Refining credit assignment in rl training of llms

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InInternational Conference on Machine Learning, pages 29557–29590. PMLR, 2025

  21. [21]

    Segment policy optimization: Effective segment-level credit assignment in rl for large language models

    Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  22. [22]

    Group-in-group policy optimization for llm agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  23. [23]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  24. [24]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  25. [25]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024. 11

  26. [26]

    Dpo meets ppo: Reinforced token optimization for rlhf

    Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. Dpo meets ppo: Reinforced token optimization for rlhf. InInternational Conference on Machine Learning, pages 78498–78521. PMLR, 2025

  27. [27]

    Brite: Bootstrapping reinforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

    Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, et al. Brite: Bootstrapping reinforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

  28. [28]

    Free process rewards without process labels

    Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. InInternational Conference on Machine Learning, pages 73511–73525. PMLR, 2025

  29. [29]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

  30. [30]

    Optimizing test-time compute via meta reinforcement finetuning

    Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement finetuning. InInternational Conference on Machine Learning, pages 50893–50925. PMLR, 2025

  31. [31]

    Generative verifiers: Reward modeling as next-token prediction

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InThe Thirteenth International Conference on Learning Representations, 2024

  32. [32]

    Generative reward models.arXiv preprint arXiv:2410.12832, 2024

    Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models.arXiv preprint arXiv:2410.12832, 2024

  33. [33]

    Inference-time scaling for generalist reward modeling, 2025

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025

  34. [34]

    Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

  35. [35]

    Reward reasoning models

    Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Reward reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  36. [36]

    Genprm: Scaling test-time compute of process reward models via generative reasoning

    Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, et al. Genprm: Scaling test-time compute of process reward models via generative reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34932–34940, 2026

  37. [37]

    Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

    Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

  38. [38]

    Rl tango: Reinforcing generator and verifier together for language reasoning

    Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S Boning, and Dina Katabi. Rl tango: Reinforcing generator and verifier together for language reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  39. [39]

    Guhao Feng and Han Zhong. Rethinking model-based, policy-based, and value-based rein- forcement learning via the lens of representation complexity.Advances in Neural Information Processing Systems, 37:8569–8611, 2024

  40. [40]

    The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

    William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 12

  42. [42]

    Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview- with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca30...

  43. [43]

    Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

    Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

  44. [44]

    Chain of thought empowers transformers to solve inherently serial problems

    Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations, 2024

  45. [45]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  46. [46]

    Preference fine-tuning of llms should leverage suboptimal, on-policy data

    Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. InInternational Conference on Machine Learning, pages 47441– 47474. PMLR, 2024

  47. [47]

    Sft memorizes, rl generalizes: A comparative study of foundation model post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. InInternational Conference on Machine Learning, pages 10818–10838. PMLR, 2025

  48. [48]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8:229–256, 1992

  49. [49]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  50. [50]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  51. [51]

    Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  52. [52]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  53. [53]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  54. [54]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  55. [55]

    PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848...