pith. machine review for the scientific record. sign in

arxiv: 2605.02944 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI· cs.SE

Recognition: unknown

Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SE
keywords reinforcement learningcode generationpass-rate rewardbinary rewardgradient analysislarge language modelsreward design
0
0 comments X

The pith

Pass-rate rewards do not reliably improve final performance over binary rewards in critic-free RL for code generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether replacing the sparse binary reward with a denser pass-rate reward helps reinforcement learning improve LLMs on code generation tasks. Controlled experiments across models and algorithms show no reliable gains in final performance from the pass-rate approach. The authors trace this to the pass rate acting as a miscalibrated signal that creates opposing gradient updates from partially correct solutions in the same batch. This finding indicates that denser rewards alone do not solve the problem if they do not point consistently toward fully correct outputs.

Core claim

Despite reducing reward sparsity, pass-rate rewards do not reliably improve final performance over binary rewards in critic-free RL for code generation. Analysis shows that while pass-rate rewards are denser, they fail to consistently move probability mass toward full-pass solutions because the test-case pass rate miscalibrates progress and partial-pass solutions induce conflicting gradient directions that cancel out.

What carries the argument

Gradient direction analysis of pass-rate rewards, which reveals cancelling effects from partial-pass solutions within sampling groups.

If this is right

  • Pass-rate rewards remain insufficient for improving code generation in critic-free RL setups.
  • Binary rewards can be as effective as denser alternatives when gradients conflict.
  • Reward functions must align optimization directions with the objective of full correctness.
  • Conflicting updates from miscalibrated surrogates limit learning in grouped sampling RL methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gradient cancellation could occur in other RL domains using partial credit rewards, like theorem proving.
  • Reward designs incorporating penalties for inconsistent partial solutions might mitigate the issue.
  • Applying the same analysis to actor-critic methods could show if critics help resolve conflicts.

Load-bearing premise

That the results from the specific set of base models, algorithms, and controlled setups generalize to other cases and that the gradient analysis captures the primary reason for lack of improvement.

What would settle it

Finding a consistent performance advantage for pass-rate rewards over binary rewards in repeated rigorous experiments with different models or algorithms would falsify the main conclusion.

Figures

Figures reproduced from arXiv: 2605.02944 by Hui Sun, Ming Li, Ren-Biao Liu, Xin-Ye Li, Yun-Ji Zhang, Zheng Xie.

Figure 1
Figure 1. Figure 1: Learning curves on DeepSeek-R1-Distill-Qwen-7B (pass@1 vs. training steps). Left: comparing binary and pass-rate rewards under GRPO and RLOO. Right: comparing pass-rate reward variants (reweighted pass-rate, two-stage) and binary reward under GRPO [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pass-rate reward density analysis. Left: 77.5% of groups contain 3+ distinct reward values. Right: Sample-level pass-rates span the full [0, 1] range, with 47.2% intermediate values. 4. Deep Analysis Section 3 presents a counterintuitive finding: pass-rate re￾ward and its variants do not improve code generation perfor￾mance in RL over binary reward. This raises a fundamental question: why doesn’t pass-rate… view at source ↗
Figure 3
Figure 3. Figure 3: We plot the distribution of ∆grp (Equation (12)) induced by the probe update in Equation (11). We report length-normalized ∆grp for visualization. The pass-rate, without-full setting con￾centrates sharply near zero, indicating weak progress when only partial-pass samples are present, whereas augmenting the group with a full-pass reference solution (pass-rate or binary reward) yields a noticeably more posit… view at source ↗
Figure 4
Figure 4. Figure 4: Sample-level directional effects reveal intra-group gradi￾ent conflict. For 20 randomly selected tasks (each task corresponds to one rollout group in the without-full regime), we show boxplots of ∆i (Equation (14)) over the N=16 samples in the group. We report length-normalized ∆i for visualization. Many groups ex￾hibit a mixed-sign distribution with both positive and negative ∆i, indicating “push–pull” up… view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template with starter code used in our training and evaluation. Prompt Template without Starter Code System: You are a helpful programming assistant. The user will ask you a question and you as the assistant solve it. The assistant first thinks how to solve the task through reasoning and then provides the user with the final answer. The reasoning process and answer are enclosed within <think>...</th… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template without starter code used in our training and evaluation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for error analysis. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Reinforcement learning (RL) from unit-test feedback has become a standard post-training recipe for improving large language models (LLMs) on code generation. However, the pass-all-tests binary reward can be sparse, yielding no learning signal on challenging problems where none of the sampled solutions passes all tests. A common remedy is to use the test-case pass rate as a surrogate reward. In this work, we study pass-rate rewards in critic-free RL for code generation (e.g., GRPO and RLOO) and report a consistent pattern across base models and algorithms: despite alleviating reward sparsity, pass-rate rewards do not reliably improve final performance over binary rewards in rigorous controlled experiments. To understand this discrepancy, we analyze reward density and the resulting gradient directions. We find that pass-rate rewards are denser, but the induced gradient updates do not consistently move probability mass toward full-pass solutions. This arises because test-case pass rate is a miscalibrated surrogate for progress toward full correctness, and partial-pass solutions within the same group can induce conflicting gradient directions that cancel out. Overall, our results suggest that, in critic-free RL, pass-rate rewards are insufficient to improve code generation and motivate reward designs that better align optimization with the goal of full correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines pass-rate rewards as a denser alternative to binary (pass-all-tests) rewards in critic-free RL algorithms such as GRPO and RLOO for post-training LLMs on code generation. It reports that, across multiple base models and algorithms in controlled experiments, pass-rate rewards fail to improve final performance over binary rewards. The authors attribute this to the pass-rate being a miscalibrated surrogate for full correctness, which produces denser signals but induces conflicting gradient directions among partial-pass solutions within the same group, preventing consistent movement of probability mass toward fully correct solutions.

Significance. If the empirical pattern and mechanistic analysis hold, the result is significant for the field of RL-based code generation. It challenges the common assumption that denser surrogate rewards like test-case pass rates will reliably aid optimization in sparse-reward settings, and it identifies a concrete limitation arising from group-relative baselines. The work motivates reward designs that better align with the binary goal of full correctness rather than partial progress.

major comments (2)
  1. [§4] §4 (gradient-direction analysis): The explanation that 'partial-pass solutions within the same group can induce conflicting gradient directions that cancel out' is not isolated from the group-relative baseline subtraction used in GRPO and RLOO. The paper shows increased reward density but does not report a controlled ablation that holds group composition fixed while varying only the reward function (binary vs. pass-rate) or measures the policy-gradient inner product with the direction toward full-pass solutions. Without this, the lack of performance improvement could stem from baseline normalization rather than miscalibration per se.
  2. [§3, §5] Experimental sections (§3 and §5): The central claim of a 'consistent pattern across base models and algorithms' and 'rigorous controlled experiments' requires explicit reporting of effect sizes, variance across runs, statistical tests, and hyperparameter sensitivity analyses. The current description does not quantify how often pass-rate underperforms binary or whether the result is robust to changes in group size, learning rate, or sampling temperature.
minor comments (2)
  1. [Abstract, §2] The abstract and introduction use 'rigorous controlled experiments' without defining the controls (e.g., matched compute, identical sampling budgets, or fixed random seeds). Clarify this in the methods.
  2. [§2] Notation for advantage estimation in GRPO/RLOO should be made explicit when contrasting binary and pass-rate rewards, to help readers follow the conflicting-gradient argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which have helped us clarify the scope of our analysis and strengthen the empirical reporting. We address each major comment below and have revised the manuscript accordingly to improve rigor without altering the core findings.

read point-by-point responses
  1. Referee: [§4] §4 (gradient-direction analysis): The explanation that 'partial-pass solutions within the same group can induce conflicting gradient directions that cancel out' is not isolated from the group-relative baseline subtraction used in GRPO and RLOO. The paper shows increased reward density but does not report a controlled ablation that holds group composition fixed while varying only the reward function (binary vs. pass-rate) or measures the policy-gradient inner product with the direction toward full-pass solutions. Without this, the lack of performance improvement could stem from baseline normalization rather than miscalibration per se.

    Authors: We agree that explicitly isolating the reward function from the baseline mechanism would strengthen the mechanistic claim. Our analysis is performed within the standard critic-free group-relative setting (GRPO/RLOO), where the baseline is the group mean; the conflicting directions we identify are a direct consequence of pass-rate assigning heterogeneous values to partial solutions inside that group, whereas binary rewards assign uniform values. This interaction is inherent to the algorithms studied. To better isolate the effect, we have added in the revised §4 an explicit computation of the inner product (cosine similarity) between the policy gradient vector and the direction implied by the binary reward (as a proxy for movement toward full correctness). This metric is reported for both reward types under identical group compositions and shows systematically lower alignment for pass-rate rewards. We have also clarified the text to note that while a non-group baseline would be an interesting extension, it lies outside the critic-free paradigm under investigation. These changes are marked as partial revisions. revision: partial

  2. Referee: [§3, §5] Experimental sections (§3 and §5): The central claim of a 'consistent pattern across base models and algorithms' and 'rigorous controlled experiments' requires explicit reporting of effect sizes, variance across runs, statistical tests, and hyperparameter sensitivity analyses. The current description does not quantify how often pass-rate underperforms binary or whether the result is robust to changes in group size, learning rate, or sampling temperature.

    Authors: The referee correctly notes that the original text could be more quantitative. We have revised §3 and §5 (and added an appendix) to include: (i) mean pass@1 scores with standard deviations across three independent random seeds per configuration; (ii) effect sizes as the signed difference in final performance between pass-rate and binary rewards; (iii) Wilcoxon signed-rank tests across problems to assess whether differences are statistically significant; and (iv) sensitivity tables varying group size (4–32), learning rate, and sampling temperature. These additions confirm that pass-rate rewards show no statistically significant improvement and underperform binary rewards in the majority of settings, with the pattern robust to the tested hyperparameter ranges. The revised manuscript now quantifies the consistency claim as requested. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons and gradient analysis are independent of inputs

full rationale

The paper's central claims rest on controlled experiments comparing binary and pass-rate rewards across base models and algorithms (GRPO, RLOO), plus direct measurement of reward density and resulting policy gradients. No derivation reduces to its own inputs by construction, no parameters are fitted then renamed as predictions, and no self-citation chain or uniqueness theorem is invoked to force the conclusion. The gradient-conflict explanation follows from the group-relative advantage formulation and observed pass-rate variance within groups, which is externally verifiable rather than tautological. This is a standard empirical study whose results can be falsified by replication.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL policy-gradient assumptions and empirical testing; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • standard math Policy-gradient estimators in GRPO and RLOO correctly reflect the direction of updates induced by the chosen reward.
    Invoked when the paper analyzes why pass-rate rewards produce canceling gradients.

pith-pipeline@v0.9.0 · 5536 in / 1315 out tokens · 36652 ms · 2026-05-09T19:30:18.683754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Science , volume=

    Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

  2. [2]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , year =

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , year =

  3. [3]

    Openai o1 system card , year =

    Jaech, Aaron and Kalai, Adam and Lerer, Adam and Richardson, Adam and El-Kishky, Ahmed and Low, Aiden and Helyar, Alec and Madry, Aleksander and Beutel, Alex and Carney, Alex and others , journal =. Openai o1 system card , year =

  4. [4]

    Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V

    Chervonyi, Yuri and Trinh, Trieu H and Ol. arXiv preprint arXiv:2502.03544 , title =

  5. [5]

    Swe-bench: Can language models resolve real-world github issues? , year =

    Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =. Swe-bench: Can language models resolve real-world github issues? , year =

  6. [6]

    2025 , url =

    Liu, Jiawei and Zhang, Lingming , title =. 2025 , url =

  7. [7]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding , year =

    Xu, Zhangchen and Liu, Yang and Yin, Yueqin and Zhou, Mingyuan and Poovendran, Radha , journal =. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding , year =

  8. [8]

    Dapo: An open-source llm reinforcement learning system at scale , year =

    Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal =. Dapo: An open-source llm reinforcement learning system at scale , year =

  9. [9]

    GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , year =

    Hong, Wenyi and Yu, Wenmeng and Gu, Xiaotao and Wang, Guo and Gan, Guobing and Tang, Haomiao and Cheng, Jiale and Qi, Ji and Ji, Junhui and Pan, Lihang and others , journal =. GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , year =

  10. [10]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , year =

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal =. Deepseekmath: Pushing the limits of mathematical reasoning in open language models , year =

  11. [11]

    Process reinforcement through implicit rewards , year =

    Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Zhang, Yuchen and Chen, Jiacheng and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and others , journal =. Process reinforcement through implicit rewards , year =

  12. [12]

    2025 , url =

    Yinjie Wang and Ling Yang and Ye Tian and Ke Shen and Mengdi Wang , booktitle =. 2025 , url =

  13. [13]

    2025 , url =

    Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals , author =. 2025 , url =

  14. [14]

    Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , year =

    Li, Ziniu and Xu, Tian and Zhang, Yushun and Lin, Zhihang and Yu, Yang and Sun, Ruoyu and Luo, Zhi-Quan , journal =. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , year =

  15. [15]

    Proximal policy optimization algorithms , year =

    Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal =. Proximal policy optimization algorithms , year =

  16. [16]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Ahmadian, Arash and Cremer, Chris and Gall. arXiv preprint arXiv:2402.14740 , title =

  17. [17]

    High-dimensional continuous control using generalized advantage estimation , year =

    Schulman, John and Moritz, Philipp and Levine, Sergey and Jordan, Michael and Abbeel, Pieter , journal =. High-dimensional continuous control using generalized advantage estimation , year =

  18. [18]

    Understanding r1-zero-like training: A critical perspective , year =

    Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , journal =. Understanding r1-zero-like training: A critical perspective , year =

  19. [19]

    Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models , year =

    Hu, Jian and Liu, Jason Klein and Xu, Haotian and Shen, Wei , journal =. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models , year =

  20. [20]

    Training language models to follow instructions with human feedback , year =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , journal =. Training language models to follow instructions with human feedback , year =

  21. [21]

    WizardCoder: Empowering Code Large Language Models with Evol-Instruct , year =

    Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang , booktitle =. WizardCoder: Empowering Code Large Language Models with Evol-Instruct , year =

  22. [22]

    SelfCodeAlign: Self-Alignment for Code Generation , year =

    Yuxiang Wei and Federico Cassano and Jiawei Liu and Yifeng Ding and Naman Jain and Zachary Mueller and Harm de Vries and Leandro Von Werra and Arjun Guha and LINGMING ZHANG , booktitle =. SelfCodeAlign: Self-Alignment for Code Generation , year =

  23. [23]

    2025 , url =

    Jonas Gehring and Kunhao Zheng and Jade Copet and Vegard Mella and Taco Cohen and Gabriel Synnaeve , booktitle =. 2025 , url =

  24. [24]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning , year =

    Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven Chu Hong , journal =. Coderl: Mastering code generation through pretrained models and deep reinforcement learning , year =

  25. [25]

    Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models , year =

    Samadi, Mehrzad and Ficek, Aleksander and Narenthiran, Sean and Jain, Siddhartha and Ahmad, Wasi Uddin and Majumdar, Somshubra and Noroozi, Vahid and Ginsburg, Boris , journal =. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models , year =

  26. [26]

    2025 , url =

    Yuxiang Wei and Olivier Duchenne and Jade Copet and Quentin Carbonneaux and LINGMING ZHANG and Daniel Fried and Gabriel Synnaeve and Rishabh Singh and Sida Wang , booktitle =. 2025 , url =

  27. [27]

    COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging , year =

    Gautam, Kuldeep and VenkataKeerthy, S and Upadrasta, Ramakrishna , journal =. COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging , year =

  28. [28]

    LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? , year =

    Zheng, Zihan and Cheng, Zerui and Shen, Zeyu and Zhou, Shang and Liu, Kaiyuan and He, Hansen and Li, Dongruixuan and Wei, Stanley and Hao, Hangyi and Yao, Jianzhu and others , journal =. LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? , year =

  29. [29]

    2023 , issn =

    Jiate Liu and Yiqin Zhu and Kaiwen Xiao and QIANG FU and Xiao Han and Yang Wei and Deheng Ye , journal =. 2023 , issn =

  30. [30]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? , year =

    Deng, Xiang and Da, Jeff and Pan, Edwin and He, Yannis Yiming and Ide, Charles and Garg, Kanak and Lauffer, Niklas and Park, Andrew and Pasari, Nitin and Rane, Chetan and others , journal =. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? , year =

  31. [31]

    Measuring Coding Challenge Competence With APPS

    Measuring coding challenge competence with apps , author=. arXiv preprint arXiv:2105.09938 , year=

  32. [32]

    DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

    Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica , howpublished =. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

  33. [33]

    DeepSWE: Training a State-of-the-Art Coding Agent from Scratch by Scaling RL , year =

    Michael Luo and Naman Jain and Jaskirat Singh and Sijun Tan and Ameen Patel and Qingyang Wu and Alpay Ariyak and Colin Cai and Tarun Venkat and Shang Zhu and Ben Athiwaratkun and Manan Roongta and Ce Zhang and Li Erran Li and Raluca Ada Popa and Koushik Sen and Ion Stoica , howpublished =. DeepSWE: Training a State-of-the-Art Coding Agent from Scratch by ...

  34. [34]

    2024 , url =

    IOI , title =. 2024 , url =

  35. [35]

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents , year =

    Jain, Naman and Singh, Jaskirat and Shetty, Manish and Zheng, Liang and Sen, Koushik and Stoica, Ion , journal =. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents , year =

  36. [36]

    Evaluating large language models trained on code , year =

    Chen, Mark , journal =. Evaluating large language models trained on code , year =

  37. [37]

    Codebleu: a method for automatic evaluation of code synthesis , year =

    Ren, Shuo and Guo, Daya and Lu, Shuai and Zhou, Long and Liu, Shujie and Tang, Duyu and Sundaresan, Neel and Zhou, Ming and Blanco, Ambrosio and Ma, Shuai , journal =. Codebleu: a method for automatic evaluation of code synthesis , year =

  38. [38]

    B leu: a method for automatic evaluation of machine translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =. 2002 , address =. doi:10.3115/1073083.1073135 , url =

  39. [39]

    MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , year =

    Xiaomi, LLM and Xia, Bingquan and Shen, Bowen and Zhu, Dawei and Zhang, Di and Wang, Gang and Zhang, Hailin and Liu, Huaqiu and Xiao, Jiebao and Dong, Jinhao and others , journal =. MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , year =

  40. [40]

    Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization , year =

    Zhuang, Zhuoran and Chen, Ye and Su, Jianghao and Luo, Chao and Liu, Luhui and Zeng, Xia , journal =. Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization , year =

  41. [41]

    Exploring the limit of outcome reward for learning mathematical reasoning , year =

    Lyu, Chengqi and Gao, Songyang and Gu, Yuzhe and Zhang, Wenwei and Gao, Jianfei and Liu, Kuikun and Wang, Ziyi and Li, Shuaibin and Zhao, Qian and Huang, Haian and others , journal =. Exploring the limit of outcome reward for learning mathematical reasoning , year =

  42. [42]

    Kimi k1.5: Scaling reinforcement learning with llms , year =

    Team, Kimi and Du, Angang and Gao, Bofei and Xing, Bowei and Jiang, Changjiu and Chen, Cheng and Li, Cheng and Xiao, Chenjun and Du, Chenzhuang and Liao, Chonghua and others , journal =. Kimi k1.5: Scaling reinforcement learning with llms , year =

  43. [43]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , year =

    Chen, Aili and Li, Aonian and Gong, Bangwei and Jiang, Binyang and Fei, Bo and Yang, Bo and Shan, Boji and Yu, Changqing and Wang, Chao and Zhu, Cheng and others , journal =. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , year =

  44. [44]

    The art of scaling reinforcement learning compute for llms , year =

    Khatri, Devvrit and Madaan, Lovish and Tiwari, Rishabh and Bansal, Rachit and Duvvuri, Sai Surya and Zaheer, Manzil and Dhillon, Inderjit S and Brandfonbrener, David and Agarwal, Rishabh , journal =. The art of scaling reinforcement learning compute for llms , year =

  45. [45]

    Group sequence policy optimization , year =

    Zheng, Chujie and Liu, Shixuan and Li, Mingze and Chen, Xiong-Hui and Yu, Bowen and Gao, Chang and Dang, Kai and Liu, Yuqiong and Men, Rui and Yang, An and others , journal =. Group sequence policy optimization , year =

  46. [46]

    Language models are few-shot learners , year =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , journal =. Language models are few-shot learners , year =

  47. [47]

    Forty-second International Conference on Machine Learning , year=

    Revisiting Chain-of-Thought in Code Generation: Do Language Models Need to Learn Reasoning before Coding? , author=. Forty-second International Conference on Machine Learning , year=

  48. [48]

    Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis

    Incoder: A generative model for code infilling and synthesis , author=. arXiv preprint arXiv:2204.05999 , year=

  49. [49]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  50. [50]

    arXiv preprint arXiv:2401.03003 , year=

    Ast-t5: Structure-aware pretraining for code generation and understanding , author=. arXiv preprint arXiv:2401.03003 , year=

  51. [51]

    2025 , url =

    Yiyou Sun and Yuhan Cao and Pohao Huang and Haoyue Bai and Hannaneh Hajishirzi and Nouha Dziri and Dawn Song , booktitle =. 2025 , url =

  52. [52]

    and Barto, Andrew G

    Sutton, Richard S. and Barto, Andrew G. , publisher =. Reinforcement Learning: An Introduction , year =

  53. [53]

    2025 , url =

    Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica , booktitle =. 2025 , url =

  54. [54]

    2024 , howpublished =

    Xiangyu Li , title =. 2024 , howpublished =

  55. [55]

    SPoC: Search-based Pseudocode to Code , year =

    Kulal, Sumith and Pasupat, Panupong and Chandra, Kartik and Lee, Mina and Padon, Oded and Aiken, Alex and Liang, Percy S , booktitle =. SPoC: Search-based Pseudocode to Code , year =

  56. [56]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  57. [57]

    Soft Adaptive Policy Optimization

    Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=

  58. [58]

    Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

    Skywork open reasoner 1 technical report , author=. arXiv preprint arXiv:2505.22312 , year=

  59. [59]

    Proceedings of the Twentieth European Conference on Computer Systems , pages=

    Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

  60. [60]

    QwQ-32B: Embracing the Power of Reinforcement Learning , url =

    Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =

  61. [61]

    Qwen2.5 Technical Report

    Qwen2. 5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

  62. [62]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , year =

    Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , journal =. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , year =

  63. [63]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , year =

    Gandhi, Kanishk and Chakravarthy, Ayush and Singh, Anikait and Lile, Nathan and Goodman, Noah D , journal =. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , year =

  64. [64]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  65. [65]

    Sutherland , booktitle=

    Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of. 2025 , url=

  66. [66]

    Frontiers of Computer Science , volume=

    Top Pass: improve code generation by pass@ k-maximized code ranking , author=. Frontiers of Computer Science , volume=. 2025 , publisher=