arxiv: 2511.07833 · v3 · submitted 2025-11-11 · 💻 cs.LG · cs.AI

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

Chanakya Ekbote , Vijay Lingam , Sujay Sanghavi , Jun Huan , Behrooz Omidvar-Tehrani , Anoop Deoras , Stefano Soatto This is my paper

Pith reviewed 2026-05-17 23:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-turn reinforcement learningcode generationGRPOreward propagationself-correctionexecution feedbackLLM post-trainingagentic optimization

0 comments

The pith

MURPHY adapts GRPO to multi-turn code generation by building feedback trees and propagating rewards backward from successful refinements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MURPHY as a multi-turn extension of Group Relative Policy Optimization tailored to self-correcting code generation. Single-turn GRPO optimizes only from final rewards on isolated pairs, which does not fit settings where models must respond to execution feedback over several turns. MURPHY instead builds trees of candidate solutions, attaches executor feedback to failures, and expands them into later turns. It then sends the terminal reward back through the tree so that early attempts producing useful feedback receive credit via either max or mean aggregation. The resulting method yields measurable gains on standard code benchmarks, especially for medium and hard problems that benefit most from iterative refinement.

Core claim

MURPHY constructs feedback-conditioned rollout trees in which failed candidate solutions are paired with executor feedback and expanded into subsequent turns, and propagates rewards backward through the tree so that later successful refinements credit earlier attempts that surfaced informative feedback. It studies two propagation strategies, Max Reward (MARS) and Mean Reward (MERS), and introduces post-rollout pruning mechanisms that reduce multi-turn optimization cost.

What carries the argument

Feedback-conditioned rollout trees with retrospective backward reward propagation via MARS or MERS strategies

If this is right

Up to 6 percent absolute pass@1 improvement over the strongest prior multi-turn execution-feedback methods across three code benchmarks.
Largest gains appear on the Medium and Hard subsets, reaching +4.38 and +4.20 at iteration 5.
Post-rollout pruning lowers the computational cost of maintaining multi-turn trees during optimization.
The gains hold across two model families and three benchmarks including HumanEval, MBPP, and LiveCodeBench-v6.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Single-turn RL methods may systematically undervalue intermediate feedback signals that only become useful after later corrections occur.
The tree-based credit mechanism could transfer to other iterative agent tasks such as theorem proving or multi-step planning where partial failures provide diagnostic information.
Extending the pruning rules or testing longer interaction horizons would reveal how tree depth affects credit assignment stability.

Load-bearing premise

Code-executor feedback is sufficiently informative and consistent to support reliable backward credit assignment, and that post-rollout pruning does not discard trajectories that would have produced better policy updates.

What would settle it

An ablation that keeps the same multi-turn rollout trees but assigns the final reward only to the last turn without any backward propagation, then checks whether the reported gains over prior multi-turn baselines disappear.

Figures

Figures reproduced from arXiv: 2511.07833 by Anoop Deoras, Behrooz Omidvar-Tehrani, Chanakya Ekbote, Jun Huan, Stefano Soatto, Sujay Sanghavi, Vijay Lingam.

**Figure 1.** Figure 1: Percentage change in coding problems solved by models trained with MURPHY and GRPO over the base model across three models and datasets. MURPHY-trained models solve up to 4.2% more problems than GRPO. See Tab. 1 for details. A growing body of work explores large language models (LLMs) as software engineering agents that interact with their environment through code execution and feedback (Shinn et al., 2023… view at source ↗

**Figure 2.** Figure 2: Overview of MURPHY. Given an input prompt (q), G code generations (o) are generated and evaluated using a reward function (r). Generations that do not achieve the maximum reward are revised based on executor feedback (f), combining the original prompt with the failed output, and re-prompted to generate another G candidates. This iterative process continues for a fixed number of turns, with rewards from lat… view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard recipe for post-training LLMs on reasoning tasks, with Group Relative Policy Optimization (GRPO) emerging as a leading approach. However, GRPO and its variants are inherently single-turn: they optimize from terminal rewards on isolated prompt-response pairs, leaving them poorly suited to agentic settings where models must iteratively refine solutions in response to environmental feedback. We introduce MURPHY, a multi-turn extension of GRPO for self-correcting code generation. MURPHY constructs feedback-conditioned rollout trees in which failed candidate solutions are paired with executor feedback and expanded into subsequent turns, and propagates rewards backward through the tree so that later successful refinements credit earlier attempts that surfaced informative feedback. We study two propagation strategies, Max Reward (MARS) and Mean Reward (MERS), and introduce post-rollout pruning mechanisms that reduce multi-turn optimization cost. Across three code generation benchmarks (HumanEval, MBPP, LiveCodeBench-v6) and two model families (Qwen3-1.7B/4B, OLMo-2-7B), MURPHY delivers up to 6% absolute pass@1 gains over the strongest prior multi-turn execution-feedback methods. Gains are largest on the Medium/Hard subset (+4.38/+4.20 at Iter-5), where iterative self-correction matters more.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MURPHY adds feedback-conditioned rollout trees and backward propagation to GRPO for multi-turn code generation, with reported gains on harder problems that look plausible but rest on limited experimental detail.

read the letter

The main point is that MURPHY takes GRPO, which is built for single-turn terminal rewards, and adapts it to iterative code fixing by building trees that link failed attempts to executor feedback and then push success signals backward with two simple rules, MARS and MERS, plus some pruning. That structure is the actual addition over the single-turn GRPO papers they cite. The reported lifts, up to 6% absolute pass@1 and larger on medium/hard subsets, come from runs on HumanEval, MBPP, and LiveCodeBench with Qwen and OLMo models, and they beat the strongest prior multi-turn baselines they compare against. The pruning step is a sensible engineering choice to keep the multi-turn cost manageable. The framing is direct: single-turn methods ignore the value of earlier turns that surface useful feedback, and the tree plus propagation tries to fix that without adding many new knobs. The soft spots are mostly in the evidence. The abstract gives headline numbers but no statistical tests, no ablation that isolates the credit-assignment piece from simply doing more iterations, and no full baseline implementation details. The concern about executor feedback being consistent enough for reliable backward assignment is reasonable; if tests are noisy or only fail later, the advantage estimates could get mis-attributed. Without seeing those controls in the full experiments, it's hard to know how much of the gain is really from the new propagation versus other factors. This is useful reading for anyone working on RLVR for agentic code tasks or iterative refinement. A reader who wants concrete ideas for multi-turn policy gradients would get something out of the algorithmic section even if the numbers need verification. I would send it to peer review so the experimental claims can be checked properly.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces MURPHY, a multi-turn extension of Group Relative Policy Optimization (GRPO) for self-correcting code generation. It constructs feedback-conditioned rollout trees in which failed candidate solutions are paired with executor feedback and expanded, then propagates terminal rewards backward using two strategies: Max Reward (MARS) and Mean Reward (MERS). Post-rollout pruning is added to control optimization cost. Across HumanEval, MBPP, and LiveCodeBench-v6 with Qwen3-1.7B/4B and OLMo-2-7B models, the method is reported to deliver up to 6% absolute pass@1 gains over prior multi-turn execution-feedback baselines, with the largest improvements on medium/hard subsets (+4.38/+4.20 at iteration 5).

Significance. If the empirical results hold under rigorous scrutiny, MURPHY provides a concrete algorithmic advance for applying RLVR-style methods to agentic, multi-turn settings where single-turn GRPO is insufficient. The explicit construction of feedback-conditioned trees and the MARS/MERS propagation rules constitute a reproducible contribution that directly targets retrospective credit assignment; the pruning mechanism is a practical addition for cost control. The reported gains on harder problem subsets suggest the approach may be particularly useful where iterative self-correction is required.

major comments (3)

[Abstract and Section 4] Abstract and Section 4 (Experiments): The abstract states concrete gains of up to 6% absolute pass@1 and specific subset improvements (+4.38/+4.20 at Iter-5), yet supplies no statistical tests, error bars, exact baseline implementation details, or ablation studies isolating the retrospective credit-assignment component from simply increasing the number of turns or enriching prompts. This information is load-bearing for the central claim that the new mechanism drives the observed improvements.
[Section 3.2] Section 3.2 (MARS and MERS propagation): The backward credit assignment relies on the assumption that executor pass/fail feedback is sufficiently consistent and informative to distinguish useful prior turns. No analysis or ablation addresses noisy, intermittent, or partial feedback (e.g., tests that fail only on later turns), which could produce mis-attributed advantages and weaken the policy gradient updates.
[Section 3.1] Section 3.1 (Feedback-conditioned rollout trees): The post-rollout pruning mechanism is introduced to reduce cost, but the manuscript does not quantify how often pruning discards trajectories that would have yielded superior policy updates, nor does it provide sensitivity analysis on the pruning threshold.

minor comments (3)

[Section 3.2] The notation for MARS versus MERS could be clarified with a single compact equation or pseudocode block showing the exact reward aggregation rule.
[Tables in Section 4] Table captions should explicitly state the number of random seeds and whether results are averaged or best-of-N.
[Section 4] A short paragraph comparing wall-clock or token cost of MURPHY versus the strongest baseline would help readers assess the practical trade-off introduced by the tree construction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract and Section 4] The abstract states concrete gains of up to 6% absolute pass@1 and specific subset improvements (+4.38/+4.20 at Iter-5), yet supplies no statistical tests, error bars, exact baseline implementation details, or ablation studies isolating the retrospective credit-assignment component from simply increasing the number of turns or enriching prompts. This information is load-bearing for the central claim that the new mechanism drives the observed improvements.

Authors: We agree that statistical rigor and isolating ablations are necessary to support the central claims. In the revised version we will add error bars computed over multiple random seeds, include statistical significance tests (paired t-tests) for the reported pass@1 gains, expand the baseline implementation details for full reproducibility, and insert a dedicated ablation that holds the number of turns and prompt content fixed while varying only the presence of MARS/MERS retrospective propagation. These changes will directly test whether the observed improvements are attributable to the credit-assignment mechanism rather than simply more turns or richer prompts. revision: yes
Referee: [Section 3.2] The backward credit assignment relies on the assumption that executor pass/fail feedback is sufficiently consistent and informative to distinguish useful prior turns. No analysis or ablation addresses noisy, intermittent, or partial feedback (e.g., tests that fail only on later turns), which could produce mis-attributed advantages and weaken the policy gradient updates.

Authors: We acknowledge that the current presentation assumes reliable executor feedback. We will revise Section 3.2 to explicitly discuss this assumption and add a new ablation that injects controlled noise (random flips of pass/fail labels at varying rates) into the feedback signals. Results for both MARS and MERS under noisy conditions will be reported, together with an analysis of any resulting degradation in policy updates. If sensitivity is observed, we will also outline a lightweight mitigation such as thresholded or confidence-weighted propagation. revision: yes
Referee: [Section 3.1] The post-rollout pruning mechanism is introduced to reduce cost, but the manuscript does not quantify how often pruning discards trajectories that would have yielded superior policy updates, nor does it provide sensitivity analysis on the pruning threshold.

Authors: We agree that a quantitative assessment of pruning's side effects is missing. In the revision we will report the fraction of trajectories pruned at each iteration, compare the terminal rewards of pruned versus retained trajectories to estimate potential loss in update quality, and present a sensitivity study across a range of pruning thresholds, showing the resulting pass@1 versus compute trade-off on all three benchmarks. This will clarify the practical impact of the pruning rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MURPHY is an explicit algorithmic construction

full rationale

The paper presents MURPHY as a direct algorithmic extension of GRPO: it defines feedback-conditioned rollout trees, introduces MARS and MERS as explicit backward propagation rules, and adds post-rollout pruning. These components are constructed by definition rather than derived as predictions that reduce to fitted inputs or prior self-citations. No equations or claims reduce a result to its own inputs by construction, and the reported pass@1 gains are empirical outcomes from applying the defined procedure on benchmarks. The derivation chain remains self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on standard RL assumptions plus two newly introduced algorithmic objects whose independent empirical support is not provided in the abstract.

axioms (1)

domain assumption Code executor feedback is reliable and carries sufficient information for credit assignment
Invoked when constructing feedback-conditioned rollout trees and when propagating rewards backward.

invented entities (2)

Feedback-conditioned rollout tree no independent evidence
purpose: Represent multi-turn attempt sequences linked by executor feedback for retrospective credit assignment
Core new data structure introduced to move beyond single-turn GRPO.
MARS and MERS reward propagation strategies no independent evidence
purpose: Propagate terminal rewards backward through the tree using max or mean
Two concrete mechanisms for retrospective credit assignment.

pith-pipeline@v0.9.0 · 5580 in / 1274 out tokens · 48521 ms · 2026-05-17T23:09:46.922566+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MURPHY constructs feedback-conditioned rollout trees... propagates rewards backward... using Max Reward (MARS) and Mean Reward (MERS)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We study two propagation strategies, Max Reward (MARS) and Mean Reward (MERS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 15 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. 2024. https://doi.org/10.18653/v1/2024.acl-long.662 Back to basics: Revisiting REINFORCE -style optimization for learning from human feedback in LLM s . In Proceedings of the 62nd Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2024.acl-long.662 2024
[4]

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235--256

work page 2002
[5]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . Preprint, arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large lang...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. 2025. https://arxiv.org/abs/2503.19470 Research: Learning to reason with search for llms via reinforcement learning . Preprint, arXiv:2503.19470

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement lea...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. https://openreview.net/forum?id=PzSG5nKe1q RLEF : Grounding code LLM s in execution feedback with reinforcement learning . In Forty-second International Conference on Machine Learning

work page 2025
[10]

Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, and Sanjiban Choudhury. 2025. https://openreview.net/forum?id=aJeLhLcsh0 Multi-turn code generation through single-step rewards . In Forty-second International Conference on Machine Learning

work page 2025
[11]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2025. https://doi.org/10.1145/3747588 A survey on large language models for code generation . ACM Trans. Softw. Eng. Methodol

work page doi:10.1145/3747588 2025
[12]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. https://openreview.net/forum?id=Rwhi91ideu Search-r1: Training LLM s to reason and leverage search engines with reinforcement learning . In Second Conference on Language Modeling

work page 2025
[13]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://arxiv.org/abs/2309.06180 Efficient memory management for large language model serving with pagedattention . Preprint, arXiv:2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Vijay Lingam, Behrooz Omidvar Tehrani, Sujay Sanghavi, Gaurav Gupta, Sayan Ghosh, Linbo Liu, Jun Huan, and Anoop Deoras. 2025. https://openreview.net/forum?id=ZsP3YbYeE9 Enhancing language model agents using diversity of thoughts . In The Thirteenth International Conference on Learning Representations

work page 2025
[15]

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. 2025. https://openreview.net/forum?id=xZXhFg43EI SWE -lancer: Can frontier LLM s earn \ 1 million from real-world freelance software engineering? In Forty-second International Conference on Machine Learning

work page 2025
[16]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, and 21 others. 2025. https://arxiv.org/abs/2501.00656 2 olmo 2 furious . Preprint,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, and 244 others. 2024. https://arxiv.org/abs/2412.16720 Openai o1 system card . Preprint, ar...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf Reflexion: language agents with verbal reinforcement learning . In Advances in Neural Information Processing Systems, volume 36, pages 8634--8652. Curran A...

work page 2023
[21]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, and 77 others. 2025. https://arxiv.org/abs/2501.12599 Kimi k1.5: Scaling reinforcement learning with llms . Prep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou \'e dec. 2020. TRL: Transformer Reinforcement Learning . https://github.com/huggingface/trl

work page 2020
[23]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. https://doi.org/10.1145/3715754 Demystifying llm-based software engineering agents . Proc. ACM Softw. Eng., 2(FSE)

work page doi:10.1145/3715754 2025
[24]

Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. 2025 a . Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning. arXiv preprint arXiv:2504.13818

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. 2025 b . Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. arXiv preprint arXiv:2503.02951

work page arXiv 2025
[26]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. https://openreview.net/forum?id=mXpq6ut8J3 SWE -agent: Agent-computer interfaces enable automated software engineering . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[28]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE_vluYUL-X React: Synergizing reasoning and acting in language models . In The Eleventh International Conference on Learning Representations

work page 2023
[29]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, and 16 others. 2025. https://arxiv.org/abs/2503.14476 Dapo: An open-source llm reinforcement learning system at scale . Preprin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. 2025. https://arxiv.org/abs/2503.01491 What's behind ppo's collapse in long-cot? value optimization holds the secret . Preprint, arXiv:2503.01491

work page arXiv 2025
[31]

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, and 8 others. 2025. https://arxiv.org/abs/2504.05118 Vapo: Efficient and reliable reinforcement learning for advanced reasonin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. 2025. Nemotron-research-tool-n1: Tool-using language models with reinforced reasoning. arXiv preprint arXiv:2505.00024

work page arXiv 2025
[33]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025. https://arxiv.org/abs/2507.18071 Group sequence policy optimization . Preprint, arXiv:2507.18071

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Li Zhong, Zilong Wang, and Jingbo Shang. 2024. Debug like a human: A large language model debugger via verifying runtime execution step by step. In Findings of the Association for Computational Linguistics ACL 2024, pages 851--870

work page 2024
[35]

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024. https://openreview.net/forum?id=njwv9BsGHF Language agent tree search unifies reasoning, acting, and planning in language models . In Forty-first International Conference on Machine Learning

work page 2024
[36]

Richard Zhuang*, Trung Vu*, Alex Dimakis, and Maheswaran Sathiamoorthy. 2025. Improving multi-turn tool use with reinforcement learning. https://www.bespokelabs.ai/blog/improving-multi-turn-tool-use-with-reinforcement-learning. Accessed: 2025-04-17

work page 2025
[37]

Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen GONG, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, and 14 others. 2025. https://openreview.net/forum?id=YrycTjllL0 Bigcodebench: Benchmarking...

work page 2025