Recognition: unknown
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
Pith reviewed 2026-05-09 19:30 UTC · model grok-4.3
The pith
Pass-rate rewards do not reliably improve final performance over binary rewards in critic-free RL for code generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Despite reducing reward sparsity, pass-rate rewards do not reliably improve final performance over binary rewards in critic-free RL for code generation. Analysis shows that while pass-rate rewards are denser, they fail to consistently move probability mass toward full-pass solutions because the test-case pass rate miscalibrates progress and partial-pass solutions induce conflicting gradient directions that cancel out.
What carries the argument
Gradient direction analysis of pass-rate rewards, which reveals cancelling effects from partial-pass solutions within sampling groups.
If this is right
- Pass-rate rewards remain insufficient for improving code generation in critic-free RL setups.
- Binary rewards can be as effective as denser alternatives when gradients conflict.
- Reward functions must align optimization directions with the objective of full correctness.
- Conflicting updates from miscalibrated surrogates limit learning in grouped sampling RL methods.
Where Pith is reading between the lines
- Similar gradient cancellation could occur in other RL domains using partial credit rewards, like theorem proving.
- Reward designs incorporating penalties for inconsistent partial solutions might mitigate the issue.
- Applying the same analysis to actor-critic methods could show if critics help resolve conflicts.
Load-bearing premise
That the results from the specific set of base models, algorithms, and controlled setups generalize to other cases and that the gradient analysis captures the primary reason for lack of improvement.
What would settle it
Finding a consistent performance advantage for pass-rate rewards over binary rewards in repeated rigorous experiments with different models or algorithms would falsify the main conclusion.
Figures
read the original abstract
Reinforcement learning (RL) from unit-test feedback has become a standard post-training recipe for improving large language models (LLMs) on code generation. However, the pass-all-tests binary reward can be sparse, yielding no learning signal on challenging problems where none of the sampled solutions passes all tests. A common remedy is to use the test-case pass rate as a surrogate reward. In this work, we study pass-rate rewards in critic-free RL for code generation (e.g., GRPO and RLOO) and report a consistent pattern across base models and algorithms: despite alleviating reward sparsity, pass-rate rewards do not reliably improve final performance over binary rewards in rigorous controlled experiments. To understand this discrepancy, we analyze reward density and the resulting gradient directions. We find that pass-rate rewards are denser, but the induced gradient updates do not consistently move probability mass toward full-pass solutions. This arises because test-case pass rate is a miscalibrated surrogate for progress toward full correctness, and partial-pass solutions within the same group can induce conflicting gradient directions that cancel out. Overall, our results suggest that, in critic-free RL, pass-rate rewards are insufficient to improve code generation and motivate reward designs that better align optimization with the goal of full correctness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines pass-rate rewards as a denser alternative to binary (pass-all-tests) rewards in critic-free RL algorithms such as GRPO and RLOO for post-training LLMs on code generation. It reports that, across multiple base models and algorithms in controlled experiments, pass-rate rewards fail to improve final performance over binary rewards. The authors attribute this to the pass-rate being a miscalibrated surrogate for full correctness, which produces denser signals but induces conflicting gradient directions among partial-pass solutions within the same group, preventing consistent movement of probability mass toward fully correct solutions.
Significance. If the empirical pattern and mechanistic analysis hold, the result is significant for the field of RL-based code generation. It challenges the common assumption that denser surrogate rewards like test-case pass rates will reliably aid optimization in sparse-reward settings, and it identifies a concrete limitation arising from group-relative baselines. The work motivates reward designs that better align with the binary goal of full correctness rather than partial progress.
major comments (2)
- [§4] §4 (gradient-direction analysis): The explanation that 'partial-pass solutions within the same group can induce conflicting gradient directions that cancel out' is not isolated from the group-relative baseline subtraction used in GRPO and RLOO. The paper shows increased reward density but does not report a controlled ablation that holds group composition fixed while varying only the reward function (binary vs. pass-rate) or measures the policy-gradient inner product with the direction toward full-pass solutions. Without this, the lack of performance improvement could stem from baseline normalization rather than miscalibration per se.
- [§3, §5] Experimental sections (§3 and §5): The central claim of a 'consistent pattern across base models and algorithms' and 'rigorous controlled experiments' requires explicit reporting of effect sizes, variance across runs, statistical tests, and hyperparameter sensitivity analyses. The current description does not quantify how often pass-rate underperforms binary or whether the result is robust to changes in group size, learning rate, or sampling temperature.
minor comments (2)
- [Abstract, §2] The abstract and introduction use 'rigorous controlled experiments' without defining the controls (e.g., matched compute, identical sampling budgets, or fixed random seeds). Clarify this in the methods.
- [§2] Notation for advantage estimation in GRPO/RLOO should be made explicit when contrasting binary and pass-rate rewards, to help readers follow the conflicting-gradient argument.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which have helped us clarify the scope of our analysis and strengthen the empirical reporting. We address each major comment below and have revised the manuscript accordingly to improve rigor without altering the core findings.
read point-by-point responses
-
Referee: [§4] §4 (gradient-direction analysis): The explanation that 'partial-pass solutions within the same group can induce conflicting gradient directions that cancel out' is not isolated from the group-relative baseline subtraction used in GRPO and RLOO. The paper shows increased reward density but does not report a controlled ablation that holds group composition fixed while varying only the reward function (binary vs. pass-rate) or measures the policy-gradient inner product with the direction toward full-pass solutions. Without this, the lack of performance improvement could stem from baseline normalization rather than miscalibration per se.
Authors: We agree that explicitly isolating the reward function from the baseline mechanism would strengthen the mechanistic claim. Our analysis is performed within the standard critic-free group-relative setting (GRPO/RLOO), where the baseline is the group mean; the conflicting directions we identify are a direct consequence of pass-rate assigning heterogeneous values to partial solutions inside that group, whereas binary rewards assign uniform values. This interaction is inherent to the algorithms studied. To better isolate the effect, we have added in the revised §4 an explicit computation of the inner product (cosine similarity) between the policy gradient vector and the direction implied by the binary reward (as a proxy for movement toward full correctness). This metric is reported for both reward types under identical group compositions and shows systematically lower alignment for pass-rate rewards. We have also clarified the text to note that while a non-group baseline would be an interesting extension, it lies outside the critic-free paradigm under investigation. These changes are marked as partial revisions. revision: partial
-
Referee: [§3, §5] Experimental sections (§3 and §5): The central claim of a 'consistent pattern across base models and algorithms' and 'rigorous controlled experiments' requires explicit reporting of effect sizes, variance across runs, statistical tests, and hyperparameter sensitivity analyses. The current description does not quantify how often pass-rate underperforms binary or whether the result is robust to changes in group size, learning rate, or sampling temperature.
Authors: The referee correctly notes that the original text could be more quantitative. We have revised §3 and §5 (and added an appendix) to include: (i) mean pass@1 scores with standard deviations across three independent random seeds per configuration; (ii) effect sizes as the signed difference in final performance between pass-rate and binary rewards; (iii) Wilcoxon signed-rank tests across problems to assess whether differences are statistically significant; and (iv) sensitivity tables varying group size (4–32), learning rate, and sampling temperature. These additions confirm that pass-rate rewards show no statistically significant improvement and underperform binary rewards in the majority of settings, with the pattern robust to the tested hyperparameter ranges. The revised manuscript now quantifies the consistency claim as requested. revision: yes
Circularity Check
No circularity; empirical comparisons and gradient analysis are independent of inputs
full rationale
The paper's central claims rest on controlled experiments comparing binary and pass-rate rewards across base models and algorithms (GRPO, RLOO), plus direct measurement of reward density and resulting policy gradients. No derivation reduces to its own inputs by construction, no parameters are fitted then renamed as predictions, and no self-citation chain or uniqueness theorem is invoked to force the conclusion. The gradient-conflict explanation follows from the group-relative advantage formulation and observed pass-rate variance within groups, which is externally verifiable rather than tautological. This is a standard empirical study whose results can be falsified by replication.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Policy-gradient estimators in GRPO and RLOO correctly reflect the direction of updates induced by the chosen reward.
Reference graph
Works this paper leans on
-
[1]
Science , volume=
Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=
2022
-
[2]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , year =
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , year =
-
[3]
Openai o1 system card , year =
Jaech, Aaron and Kalai, Adam and Lerer, Adam and Richardson, Adam and El-Kishky, Ahmed and Low, Aiden and Helyar, Alec and Madry, Aleksander and Beutel, Alex and Carney, Alex and others , journal =. Openai o1 system card , year =
-
[4]
Chervonyi, Yuri and Trinh, Trieu H and Ol. arXiv preprint arXiv:2502.03544 , title =
-
[5]
Swe-bench: Can language models resolve real-world github issues? , year =
Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =. Swe-bench: Can language models resolve real-world github issues? , year =
-
[6]
2025 , url =
Liu, Jiawei and Zhang, Lingming , title =. 2025 , url =
2025
-
[7]
Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding , year =
Xu, Zhangchen and Liu, Yang and Yin, Yueqin and Zhou, Mingyuan and Poovendran, Radha , journal =. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding , year =
-
[8]
Dapo: An open-source llm reinforcement learning system at scale , year =
Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal =. Dapo: An open-source llm reinforcement learning system at scale , year =
-
[9]
GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , year =
Hong, Wenyi and Yu, Wenmeng and Gu, Xiaotao and Wang, Guo and Gan, Guobing and Tang, Haomiao and Cheng, Jiale and Qi, Ji and Ji, Junhui and Pan, Lihang and others , journal =. GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , year =
-
[10]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , year =
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal =. Deepseekmath: Pushing the limits of mathematical reasoning in open language models , year =
-
[11]
Process reinforcement through implicit rewards , year =
Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Zhang, Yuchen and Chen, Jiacheng and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and others , journal =. Process reinforcement through implicit rewards , year =
-
[12]
2025 , url =
Yinjie Wang and Ling Yang and Ye Tian and Ke Shen and Mengdi Wang , booktitle =. 2025 , url =
2025
-
[13]
2025 , url =
Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals , author =. 2025 , url =
2025
-
[14]
Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , year =
Li, Ziniu and Xu, Tian and Zhang, Yushun and Lin, Zhihang and Yu, Yang and Sun, Ruoyu and Luo, Zhi-Quan , journal =. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , year =
-
[15]
Proximal policy optimization algorithms , year =
Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal =. Proximal policy optimization algorithms , year =
-
[16]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Ahmadian, Arash and Cremer, Chris and Gall. arXiv preprint arXiv:2402.14740 , title =
work page internal anchor Pith review arXiv
-
[17]
High-dimensional continuous control using generalized advantage estimation , year =
Schulman, John and Moritz, Philipp and Levine, Sergey and Jordan, Michael and Abbeel, Pieter , journal =. High-dimensional continuous control using generalized advantage estimation , year =
-
[18]
Understanding r1-zero-like training: A critical perspective , year =
Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , journal =. Understanding r1-zero-like training: A critical perspective , year =
-
[19]
Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models , year =
Hu, Jian and Liu, Jason Klein and Xu, Haotian and Shen, Wei , journal =. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models , year =
-
[20]
Training language models to follow instructions with human feedback , year =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , journal =. Training language models to follow instructions with human feedback , year =
-
[21]
WizardCoder: Empowering Code Large Language Models with Evol-Instruct , year =
Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang , booktitle =. WizardCoder: Empowering Code Large Language Models with Evol-Instruct , year =
-
[22]
SelfCodeAlign: Self-Alignment for Code Generation , year =
Yuxiang Wei and Federico Cassano and Jiawei Liu and Yifeng Ding and Naman Jain and Zachary Mueller and Harm de Vries and Leandro Von Werra and Arjun Guha and LINGMING ZHANG , booktitle =. SelfCodeAlign: Self-Alignment for Code Generation , year =
-
[23]
2025 , url =
Jonas Gehring and Kunhao Zheng and Jade Copet and Vegard Mella and Taco Cohen and Gabriel Synnaeve , booktitle =. 2025 , url =
2025
-
[24]
Coderl: Mastering code generation through pretrained models and deep reinforcement learning , year =
Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven Chu Hong , journal =. Coderl: Mastering code generation through pretrained models and deep reinforcement learning , year =
-
[25]
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models , year =
Samadi, Mehrzad and Ficek, Aleksander and Narenthiran, Sean and Jain, Siddhartha and Ahmad, Wasi Uddin and Majumdar, Somshubra and Noroozi, Vahid and Ginsburg, Boris , journal =. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models , year =
-
[26]
2025 , url =
Yuxiang Wei and Olivier Duchenne and Jade Copet and Quentin Carbonneaux and LINGMING ZHANG and Daniel Fried and Gabriel Synnaeve and Rishabh Singh and Sida Wang , booktitle =. 2025 , url =
2025
-
[27]
COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging , year =
Gautam, Kuldeep and VenkataKeerthy, S and Upadrasta, Ramakrishna , journal =. COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging , year =
-
[28]
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? , year =
Zheng, Zihan and Cheng, Zerui and Shen, Zeyu and Zhou, Shang and Liu, Kaiyuan and He, Hansen and Li, Dongruixuan and Wei, Stanley and Hao, Hangyi and Yao, Jianzhu and others , journal =. LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? , year =
-
[29]
2023 , issn =
Jiate Liu and Yiqin Zhu and Kaiwen Xiao and QIANG FU and Xiao Han and Yang Wei and Deheng Ye , journal =. 2023 , issn =
2023
-
[30]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? , year =
Deng, Xiang and Da, Jeff and Pan, Edwin and He, Yannis Yiming and Ide, Charles and Garg, Kanak and Lauffer, Niklas and Park, Andrew and Pasari, Nitin and Rane, Chetan and others , journal =. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? , year =
-
[31]
Measuring Coding Challenge Competence With APPS
Measuring coding challenge competence with apps , author=. arXiv preprint arXiv:2105.09938 , year=
work page internal anchor Pith review arXiv
-
[32]
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =
Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica , howpublished =. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =
-
[33]
DeepSWE: Training a State-of-the-Art Coding Agent from Scratch by Scaling RL , year =
Michael Luo and Naman Jain and Jaskirat Singh and Sijun Tan and Ameen Patel and Qingyang Wu and Alpay Ariyak and Colin Cai and Tarun Venkat and Shang Zhu and Ben Athiwaratkun and Manan Roongta and Ce Zhang and Li Erran Li and Raluca Ada Popa and Koushik Sen and Ion Stoica , howpublished =. DeepSWE: Training a State-of-the-Art Coding Agent from Scratch by ...
-
[34]
2024 , url =
IOI , title =. 2024 , url =
2024
-
[35]
R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents , year =
Jain, Naman and Singh, Jaskirat and Shetty, Manish and Zheng, Liang and Sen, Koushik and Stoica, Ion , journal =. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents , year =
-
[36]
Evaluating large language models trained on code , year =
Chen, Mark , journal =. Evaluating large language models trained on code , year =
-
[37]
Codebleu: a method for automatic evaluation of code synthesis , year =
Ren, Shuo and Guo, Daya and Lu, Shuai and Zhou, Long and Liu, Shujie and Tang, Duyu and Sundaresan, Neel and Zhou, Ming and Blanco, Ambrosio and Ma, Shuai , journal =. Codebleu: a method for automatic evaluation of code synthesis , year =
-
[38]
B leu: a method for automatic evaluation of machine translation
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =. 2002 , address =. doi:10.3115/1073083.1073135 , url =
-
[39]
MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , year =
Xiaomi, LLM and Xia, Bingquan and Shen, Bowen and Zhu, Dawei and Zhang, Di and Wang, Gang and Zhang, Hailin and Liu, Huaqiu and Xiao, Jiebao and Dong, Jinhao and others , journal =. MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , year =
-
[40]
Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization , year =
Zhuang, Zhuoran and Chen, Ye and Su, Jianghao and Luo, Chao and Liu, Luhui and Zeng, Xia , journal =. Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization , year =
-
[41]
Exploring the limit of outcome reward for learning mathematical reasoning , year =
Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others , journal =. Exploring the limit of outcome reward for learning mathematical reasoning , year =
-
[42]
Kimi k1.5: Scaling reinforcement learning with llms , year =
Team, Kimi and Du, Angang and Gao, Bofei and Xing, Bowei and Jiang, Changjiu and Chen, Cheng and Li, Cheng and Xiao, Chenjun and Du, Chenzhuang and Liao, Chonghua and others , journal =. Kimi k1.5: Scaling reinforcement learning with llms , year =
-
[43]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , year =
Chen, Aili and Li, Aonian and Gong, Bangwei and Jiang, Binyang and Fei, Bo and Yang, Bo and Shan, Boji and Yu, Changqing and Wang, Chao and Zhu, Cheng and others , journal =. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , year =
-
[44]
The art of scaling reinforcement learning compute for llms , year =
Khatri, Devvrit and Madaan, Lovish and Tiwari, Rishabh and Bansal, Rachit and Duvvuri, Sai Surya and Zaheer, Manzil and Dhillon, Inderjit S and Brandfonbrener, David and Agarwal, Rishabh , journal =. The art of scaling reinforcement learning compute for llms , year =
-
[45]
Group sequence policy optimization , year =
Zheng, Chujie and Liu, Shixuan and Li, Mingze and Chen, Xiong-Hui and Yu, Bowen and Gao, Chang and Dang, Kai and Liu, Yuqiong and Men, Rui and Yang, An and others , journal =. Group sequence policy optimization , year =
-
[46]
Language models are few-shot learners , year =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , journal =. Language models are few-shot learners , year =
-
[47]
Forty-second International Conference on Machine Learning , year=
Revisiting Chain-of-Thought in Code Generation: Do Language Models Need to Learn Reasoning before Coding? , author=. Forty-second International Conference on Machine Learning , year=
-
[48]
Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis
Incoder: A generative model for code infilling and synthesis , author=. arXiv preprint arXiv:2204.05999 , year=
-
[49]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
arXiv preprint arXiv:2401.03003 , year=
Ast-t5: Structure-aware pretraining for code generation and understanding , author=. arXiv preprint arXiv:2401.03003 , year=
-
[51]
2025 , url =
Yiyou Sun and Yuhan Cao and Pohao Huang and Haoyue Bai and Hannaneh Hajishirzi and Nouha Dziri and Dawn Song , booktitle =. 2025 , url =
2025
-
[52]
and Barto, Andrew G
Sutton, Richard S. and Barto, Andrew G. , publisher =. Reinforcement Learning: An Introduction , year =
-
[53]
2025 , url =
Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle =. 2025 , url =
2025
-
[54]
2024 , howpublished =
Xiangyu Li , title =. 2024 , howpublished =
2024
-
[55]
SPoC: Search-based Pseudocode to Code , year =
Kulal, Sumith and Pasupat, Panupong and Chandra, Kartik and Lee, Mina and Padon, Oded and Aiken, Alex and Liang, Percy S , booktitle =. SPoC: Search-based Pseudocode to Code , year =
-
[56]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Soft Adaptive Policy Optimization
Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=
work page internal anchor Pith review arXiv
-
[58]
Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
Skywork open reasoner 1 technical report , author=. arXiv preprint arXiv:2505.22312 , year=
-
[59]
Proceedings of the Twentieth European Conference on Computer Systems , pages=
Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
-
[60]
QwQ-32B: Embracing the Power of Reinforcement Learning , url =
Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =
-
[61]
Qwen2. 5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , year =
Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , journal =. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , year =
-
[63]
Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , year =
Gandhi, Kanishk and Chakravarthy, Ayush and Singh, Anikait and Lile, Nathan and Goodman, Noah D , journal =. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , year =
-
[64]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[65]
Sutherland , booktitle=
Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of. 2025 , url=
2025
-
[66]
Frontiers of Computer Science , volume=
Top Pass: improve code generation by pass@ k-maximized code ranking , author=. Frontiers of Computer Science , volume=. 2025 , publisher=
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.