Recognition: unknown
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3
The pith
Generative chain-of-thought critics replace one-shot value prediction and improve credit assignment in LLM reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A generative actor-critic framework replaces one-shot scalar value prediction with chain-of-thought reasoning followed by a value estimate, augmented by in-context conditioning that keeps the critic aligned with the current actor throughout training; this produces more accurate value functions, better ranking, and improved generalization that translate into stronger RL performance than value-based or value-free baselines.
What carries the argument
Generative Actor-Critic (GenAC) with In-Context Conditioning, which performs step-by-step reasoning before value estimation and conditions the critic on the actor to maintain calibration.
If this is right
- More accurate advantage estimates become available for credit assignment in long-horizon LLM tasks.
- Action ranking improves, allowing policies to select higher-value continuations more consistently.
- Out-of-distribution states receive more reliable value signals, supporting robust generalization.
- RL training can retain the benefits of value modeling without reverting to purely value-free methods.
Where Pith is reading between the lines
- The same generative-critic pattern could be tested in non-LLM RL domains where one-shot value heads underperform.
- If chain-of-thought value reasoning proves stable, future LLM RL pipelines may default to generative rather than discriminative critics.
- In-context conditioning might reduce the frequency of critic retraining or reset steps during long training runs.
Load-bearing premise
The main difficulty with value models comes from limited expressiveness in one-shot prediction rather than other training instabilities, and generative critics can be trained reliably while staying calibrated via in-context conditioning.
What would settle it
An experiment in which scaled-up one-shot value models achieve comparable gains in approximation accuracy, ranking reliability, and downstream RL performance would show that generative critics are not required.
Figures
read the original abstract
Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that conventional discriminative value models in LLM RL are difficult to train reliably due to limited expressiveness under one-shot scalar prediction, as suggested by representation complexity theory and supported by scaling experiments showing unreliable gains with scale. It proposes Generative Actor-Critic (GenAC), which uses a generative critic performing chain-of-thought reasoning before value estimation, combined with in-context conditioning to maintain calibration to the evolving actor policy. The work reports that GenAC yields better value approximation, ranking reliability, and out-of-distribution generalization, translating to stronger downstream RL performance than both value-based and value-free baselines.
Significance. If the empirical results hold with proper controls, this could meaningfully revive value-based methods for credit assignment in LLM RL, offering a path to improved sample efficiency and performance by leveraging generative reasoning for more expressive value functions. The scaling experiments and introduction of in-context conditioning provide concrete empirical grounding for the representation-complexity motivation, distinguishing this from purely theoretical proposals.
major comments (2)
- [§3.2] §3.2 (In-Context Conditioning): The mechanism relies on prompt-level context without updating critic parameters to the new policy distribution. This leaves open whether calibration holds under actor-induced distribution shift during RL training, which is load-bearing for the central claim that improved value modeling produces better advantage estimates and downstream returns (as noted in the abstract).
- [Abstract / §5] Abstract and §5 (Scaling Experiments and Downstream Results): The reported scaling experiments and RL performance gains are described at a high level without details on architectures, training procedures, statistical tests, ablation controls, or exact metrics for value approximation and ranking. This makes it difficult to evaluate the reliability of the claim that standard critics do not improve with scale or that GenAC gains are robust.
minor comments (1)
- [§2 / §3] The abstract and introduction introduce 'Generative Critic' and 'In-Context Conditioning' as novel terms; a brief formal definition or pseudocode in §2 or §3 would improve clarity for readers unfamiliar with the representation complexity framing.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. The comments highlight important aspects of our methodology and presentation that we address below. We have revised the manuscript to incorporate additional details and analysis where needed.
read point-by-point responses
-
Referee: [§3.2] §3.2 (In-Context Conditioning): The mechanism relies on prompt-level context without updating critic parameters to the new policy distribution. This leaves open whether calibration holds under actor-induced distribution shift during RL training, which is load-bearing for the central claim that improved value modeling produces better advantage estimates and downstream returns (as noted in the abstract).
Authors: We appreciate the referee's emphasis on distribution shift, which is indeed central to our claims. Although in-context conditioning uses prompt-level context without parameter updates to the critic, our experiments in §5 demonstrate that this approach maintains effective calibration throughout RL training. Specifically, GenAC exhibits sustained improvements in value approximation accuracy and advantage ranking reliability as the actor policy evolves, outperforming both standard value-based and value-free baselines in downstream returns. These results provide empirical support that the conditioning mechanism adapts to the shifting distribution. In the revised manuscript, we have added a new paragraph in §3.2 and supplementary plots in §5 showing the correlation between critic estimates and Monte Carlo returns over training steps to further substantiate calibration under shift. revision: partial
-
Referee: [Abstract / §5] Abstract and §5 (Scaling Experiments and Downstream Results): The reported scaling experiments and RL performance gains are described at a high level without details on architectures, training procedures, statistical tests, ablation controls, or exact metrics for value approximation and ranking. This makes it difficult to evaluate the reliability of the claim that standard critics do not improve with scale or that GenAC gains are robust.
Authors: We agree that the original presentation of the scaling experiments and RL results in the abstract and §5 was insufficiently detailed for rigorous evaluation. The revised manuscript now includes expanded descriptions in §5: full model architectures and hyperparameters for both actor and critic, training procedures for the RL loop, statistical tests (including paired t-tests with reported p-values and confidence intervals), ablation studies isolating the contributions of chain-of-thought reasoning and in-context conditioning, and precise metrics (MSE for value approximation, Kendall tau for ranking reliability, and normalized returns for downstream performance). These additions directly address the concerns and allow readers to assess the robustness of the finding that standard critics scale unreliably while GenAC gains persist. revision: yes
Circularity Check
No circularity: empirical proposal with independent experimental validation
full rationale
The paper frames GenAC as an empirical method motivated by cited representation complexity theory (one-shot scalar prediction hardness) and validated through scaling experiments plus downstream RL benchmarks. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed known results; value estimates, ranking metrics, and RL returns are measured directly against baselines rather than derived tautologically. The in-context conditioning mechanism is introduced as a practical alignment technique without any uniqueness theorem or ansatz smuggled via prior self-work. This is a standard self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Value functions are difficult to approximate under the one-shot prediction paradigm used by existing value models, per representation complexity theory.
invented entities (2)
-
Generative Critic
no independent evidence
-
In-Context Conditioning
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[5]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[6]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Coderl: Mastering code generation through pretrained models and deep reinforcement learning
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022
2022
-
[8]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 10
1998
-
[9]
Actor-critic algorithms.Advances in neural information processing systems, 12, 1999
Vijay Konda and John Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999
1999
-
[10]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review arXiv 2015
-
[12]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024
2024
-
[13]
The epoch-greedy algorithm for multi-armed bandits with side information.Advances in neural information processing systems, 20, 2007
John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information.Advances in neural information processing systems, 20, 2007
2007
-
[14]
What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025
Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025
-
[15]
Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, et al. Vrpo: Rethinking value modeling for robust rl training under noisy supervision.arXiv preprint arXiv:2508.03058, 2025
-
[16]
Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, et al. Segmental advantage estimation: Enhancing ppo for long-context llm training.arXiv preprint arXiv:2601.07320, 2026
-
[17]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review arXiv 2025
-
[18]
Truncated proximal policy optimization
Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, et al. Truncated proximal policy optimization.arXiv preprint arXiv:2506.15050, 2025
-
[19]
Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Asymmetric proximal policy optimization: mini-critics boost llm reasoning.arXiv preprint arXiv:2510.01656, 2025
-
[20]
Vineppo: Refining credit assignment in rl training of llms
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms. InInternational Conference on Machine Learning, pages 29557–29590. PMLR, 2025
2025
-
[21]
Segment policy optimization: Effective segment-level credit assignment in rl for large language models
Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[22]
Group-in-group policy optimization for llm agent training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[23]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
2023
-
[24]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024
2024
-
[25]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024. 11
work page internal anchor Pith review arXiv 2024
-
[26]
Dpo meets ppo: Reinforced token optimization for rlhf
Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. Dpo meets ppo: Reinforced token optimization for rlhf. InInternational Conference on Machine Learning, pages 78498–78521. PMLR, 2025
2025
-
[27]
Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, et al. Brite: Bootstrapping reinforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025
-
[28]
Free process rewards without process labels
Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. InInternational Conference on Machine Learning, pages 73511–73525. PMLR, 2025
2025
-
[29]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review arXiv 2025
-
[30]
Optimizing test-time compute via meta reinforcement finetuning
Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement finetuning. InInternational Conference on Machine Learning, pages 50893–50925. PMLR, 2025
2025
-
[31]
Generative verifiers: Reward modeling as next-token prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. InThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[32]
Generative reward models.arXiv preprint arXiv:2410.12832, 2024
Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models.arXiv preprint arXiv:2410.12832, 2024
-
[33]
Inference-time scaling for generalist reward modeling, 2025
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025
-
[34]
Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025
-
[35]
Reward reasoning models
Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Reward reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[36]
Genprm: Scaling test-time compute of process reward models via generative reasoning
Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, et al. Genprm: Scaling test-time compute of process reward models via generative reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34932–34940, 2026
2026
-
[37]
Process reward models that think.arXiv preprint arXiv:2504.16828, 2025
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think.arXiv preprint arXiv:2504.16828, 2025
-
[38]
Rl tango: Reinforcing generator and verifier together for language reasoning
Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S Boning, and Dina Katabi. Rl tango: Reinforcing generator and verifier together for language reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[39]
Guhao Feng and Han Zhong. Rethinking model-based, policy-based, and value-based rein- forcement learning via the lens of representation complexity.Advances in Neural Information Processing Systems, 37:8569–8611, 2024
2024
-
[40]
The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023
William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023
2023
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview- with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca30...
2025
-
[43]
Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023
2023
-
[44]
Chain of thought empowers transformers to solve inherently serial problems
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[45]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Preference fine-tuning of llms should leverage suboptimal, on-policy data
Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. InInternational Conference on Machine Learning, pages 47441– 47474. PMLR, 2024
2024
-
[47]
Sft memorizes, rl generalizes: A comparative study of foundation model post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. InInternational Conference on Machine Learning, pages 10818–10838. PMLR, 2025
2025
-
[48]
Williams
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8:229–256, 1992
1992
-
[49]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021
2021
-
[50]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
2024
-
[51]
Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
2022
-
[52]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024
2024
-
[53]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review arXiv 2024
-
[54]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[55]
PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.