arxiv: 2604.16335 · v1 · submitted 2026-03-13 · 💻 cs.LG · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

Jiawei Huang , Qingping Yang , Renjie Zheng , Jiaze Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SE

keywords generative reward modelreinforced fine-tuningSWE agentsrubric-based evaluationtrajectory filtrationLLM agentssoftware engineering tasksbehavioral patterns

0 comments

The pith

Rubric-based generative reward models give richer signals than binary test outcomes for fine-tuning software engineering agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard fine-tuning of LLM agents on software engineering tasks depends on binary terminal rewards, such as whether all tests pass. This signal only judges the end result and gives no information about the quality of the intermediate steps in multi-turn interactions. The authors introduce a generative reward model that applies human-designed rubrics to score entire trajectories, identifying specific behavioral patterns to encourage or discourage. These scores are used to filter training data for reinforced fine-tuning. Experiments and case studies indicate that the rubric approach suppresses undesirable patterns more effectively than terminal-score rejection sampling and raises final test accuracy.

Core claim

A rubric-based Generative Reward Model equipped with human-designed criteria evaluates multi-step agent trajectories to filter high-quality data for Reinforced Fine-Tuning. This method outperforms terminal-score-only rejection sampling by more effectively suppressing undesirable behavioral patterns and promoting beneficial ones, as confirmed by case analyses, and improves final test accuracy on SWE tasks.

What carries the argument

Rubric-based Generative Reward Model (GRM) that scores full trajectories against human-designed criteria for specific behavioral patterns to enable targeted trajectory filtration.

If this is right

Rubric signals enable finer control over intermediate agent behaviors than binary terminal rewards alone.
Trajectory filtration using the GRM produces higher-quality training data for reinforced fine-tuning.
Suppression of undesirable patterns and promotion of beneficial ones both occur more reliably than with terminal-score sampling.
Final test accuracy on software engineering tasks improves as a direct result of the richer learning signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric approach could apply to other long-horizon agent tasks where outcome-only rewards leave process quality unguided.
Hybrid reward models that combine rubric criteria with verifiable terminal signals might further strengthen training.
Reducing reliance on fully human-crafted rubrics through automated rubric induction would make the method more scalable.
Process-level supervision of this form addresses a general limitation in outcome-only reinforcement learning for sequential decision tasks.

Load-bearing premise

Human-designed rubrics can correctly identify and penalize undesirable intermediate behaviors in multi-step trajectories without introducing new biases or overlooking important failure modes.

What would settle it

A head-to-head experiment on the same SWE benchmark where rubric-filtered RFT shows no gain in test accuracy or no reduction in undesirable patterns relative to terminal-score rejection sampling would falsify the central claim.

read the original abstract

Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes using human rubrics inside a generative reward model to filter trajectories for reinforced fine-tuning of SWE agents, which could address the limits of terminal rewards, but the abstract shows no quantitative results to support the outperformance claim.

read the letter

The main point is a method that adds human-designed rubrics to a generative reward model so it can score and filter multi-step trajectories before reinforced fine-tuning on software engineering tasks. This is meant to give richer signals than just checking whether the final code passes all tests. The abstract frames it as a step past standard verifiable-reward rejection sampling, with case analyses showing better suppression of bad intermediate patterns and higher final accuracy. That framing is straightforward and targets a real issue in agent training where binary end signals leave the model without guidance on how to behave along the way. The idea of rubric-driven filtration for data collection is the clearest new piece here. It builds on existing RFT setups but tries to make the reward model more interpretable and controllable through explicit criteria for desirable or undesirable behaviors. That could be useful for teams that already have access to human experts who can write rubrics. The evidence presented is thin. The abstract mentions outperformance and case analyses but gives no accuracy numbers, dataset sizes, baseline details, or statistical comparisons. Without those, it is difficult to tell whether the reported gains come from the rubrics or from other factors in the training setup. The central assumption—that the rubrics reliably catch the right patterns without adding new biases or missing failure modes—also lacks any reported checks such as ablation studies or inter-rater reliability measures. If the rubrics are incomplete, the filtration step could simply reinforce whatever the rubric writers happened to notice. This work is aimed at researchers and engineers working on LLM agents for coding who are already experimenting with reward shaping and trajectory selection. A reader looking for concrete implementation details or reproducible experiments would find the current version limited, but the underlying problem it identifies is worth attention. If the full paper includes proper quantitative results and some validation of the rubrics, it would be worth sending out for peer review. Right now the claims rest too heavily on qualitative cases to stand on their own.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a rubric-based Generative Reward Model (GRM) equipped with human-designed rubrics to generate richer learning signals for trajectory filtration in Reinforced Fine-Tuning (RFT) of LLM agents on Software Engineering (SWE) tasks. It claims this outperforms terminal-score-only rejection sampling by more effectively suppressing undesirable intermediate behavioral patterns and promoting beneficial ones (as confirmed by case analyses), ultimately improving final test accuracy.

Significance. If the empirical claims hold after proper validation, the work addresses a genuine limitation in agent fine-tuning by moving beyond binary terminal rewards to shape multi-step behaviors, which could improve reliability and quality in SWE agents and similar sequential decision tasks.

major comments (2)

[Abstract] Abstract: The central claim of outperformance and improved final test accuracy is asserted without any quantitative results, baselines, error bars, dataset sizes, or experimental details, so the evidence for the claim cannot be evaluated from the provided text.
[Abstract] The method rests on the assumption that human-designed rubrics accurately identify and penalize undesirable intermediate patterns without introducing new biases or missing critical failure modes; however, the text mentions only case analyses for confirmation and provides no inter-rater reliability, rubric ablation, or automated behavioral metric comparisons.

minor comments (2)

Clarify the exact form of the GRM (e.g., how rubrics are encoded into prompts or loss terms) and the filtration criteria used for high-quality training data collection.
Add a dedicated section or table summarizing the rubrics, with concrete examples of encouraged/discouraged behaviors in SWE trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, with revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of outperformance and improved final test accuracy is asserted without any quantitative results, baselines, error bars, dataset sizes, or experimental details, so the evidence for the claim cannot be evaluated from the provided text.

Authors: We agree that the abstract should include key quantitative results to support immediate evaluation of the claims. In the revised manuscript, we will update the abstract to report specific improvements (e.g., accuracy gains over the terminal-score baseline on the primary SWE benchmark), dataset sizes, and main baselines. Full experimental details, including error bars across runs, remain in the Experiments section. revision: yes
Referee: [Abstract] The method rests on the assumption that human-designed rubrics accurately identify and penalize undesirable intermediate patterns without introducing new biases or missing critical failure modes; however, the text mentions only case analyses for confirmation and provides no inter-rater reliability, rubric ablation, or automated behavioral metric comparisons.

Authors: This is a valid concern. The original manuscript used case analyses to illustrate behavioral shaping. To strengthen the evidence, the revision will add inter-rater reliability metrics for the rubrics, rubric ablation results quantifying each component's contribution, and automated behavioral metrics for pattern suppression. These will appear in the main Experiments section and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external human rubrics and case analyses

full rationale

The paper presents a rubric-based GRM for trajectory filtration in RFT of SWE agents. No equations, derivations, or fitted parameters appear in the provided abstract or description. The approach uses independently designed human rubrics to filter trajectories and compares against terminal-score rejection sampling, with confirmation via case analyses. This does not reduce any claimed prediction to its inputs by construction, nor does it rely on self-citation chains or imported uniqueness theorems. The central claim remains empirically grounded in external criteria rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human rubrics can be written to reliably distinguish good and bad intermediate behaviors and that filtering trajectories with the GRM produces higher-quality training data than terminal rewards alone.

axioms (1)

domain assumption Human-designed rubrics can accurately evaluate and guide specific behavioral patterns in multi-step agent trajectories
The method uses these rubrics for both reward modeling and trajectory filtration; the abstract treats their quality as given.

invented entities (1)

Rubric-based Generative Reward Model (GRM) no independent evidence
purpose: To generate richer learning signals than binary terminal rewards for reinforced fine-tuning
New model introduced to address the limitation of verifiable rewards; no independent evidence of its correctness is provided in the abstract.

pith-pipeline@v0.9.0 · 5469 in / 1335 out tokens · 47448 ms · 2026-05-15T12:19:15.862954+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J(x) uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection (coupling combiner forces bilinear branch) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rubrics-based GRM filtering for collecting high-quality trajectories for Reinforced Fine-Tuning (RFT)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

work page arXiv 2025
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan Ö Arık. Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

work page arXiv 2025
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

work page arXiv 2025
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089, 2024

work page arXiv 2024
[9]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Llm-rubric: A multidimen- sional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2024

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. Llm-rubric: A multidimen- sional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2024

work page arXiv 2024
[12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024.URL https://arxiv. org/abs/2310.06770, 7, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024

work page arXiv 2024
[14]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022

work page 2022
[15]

Generative judge for evaluating alignment

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023

work page arXiv 2023
[16]

Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

work page arXiv 2024
[17]

Inference-time scaling for generalist reward modeling, 2025

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025. URLhttps://arxiv.org/abs/2504.02495. 10

work page arXiv 2025
[18]

Generative reward models.arXiv preprint arXiv:2410.12832, 2024

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models.arXiv preprint arXiv:2410.12832, 2024

work page arXiv 2024
[19]

Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv. org/abs/2412.21139

work page arXiv 2024
[20]

Rubric is all you need: Enhancing llm-based code evaluation with question-specific rubrics.arXiv preprint arXiv:2503.23989, 2025

Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, et al. Rubric is all you need: Enhancing llm-based code evaluation with question-specific rubrics.arXiv preprint arXiv:2503.23989, 2025

work page arXiv 2025
[21]

ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674, 2024

Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674, 2024

work page arXiv 2024
[24]

Salmon: Self-alignment with instructable reward models.arXiv preprint arXiv:2310.05910, 2023

Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Salmon: Self-alignment with instructable reward models.arXiv preprint arXiv:2310.05910, 2023

work page arXiv 2023
[25]

Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

ByteDance Seed Team. Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

work page 2025
[26]

Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

work page 2024
[27]

A survey on large language model based autonomous agents.Frontiersof Computer Science, 18 (6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiersof Computer Science, 18 (6):186345, 2024

work page 2024
[28]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

work page 2023
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[32]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

work page arXiv 2023
[34]

Ovm, outcome-supervised value models for planning in mathematical reasoning

Fei Yu, Anningzhe Gao, and Benyou Wang. Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023

work page arXiv 2023
[35]

Rewardanything: Generalizable principle-following reward models.arXiv preprint arXiv:2506.03637, 2025

Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, and Wei Ye. Rewardanything: Generalizable principle-following reward models.arXiv preprint arXiv:2506.03637, 2025

work page arXiv 2025
[36]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.arXiv preprint arXiv:2401.10020, 3, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Zhang, J

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339, 2024

work page arXiv 2024
[38]

len(responses)

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024. 12 Appendix A Details in GRM Prompt Engineering A.1 GRM Prompt Structures For turn-level GRM, we compare all the actions toget...

work page 2024
[41]

Trajectory

Review the Conversation History: Read through the sequence of actions taken by the LLM and the results it received. Get a sense of the model’s approach so far. Step 2: Score the Candidate Actions by the Rubrics Carefully review the user’s question and the conversation context to understand both the task and the current resolution stage. Then, evaluate eac...

work page
[42]

Identify the reported bug or requested feature and the exact solution requirements

Analyze the User Instruction: Read the GitHub issue carefully. Identify the reported bug or requested feature and the exact solution requirements

work page
[43]

Note which files changed and the nature of the change (for example, logic correction, wrong variable, missing condition)

Study the Ground-Truth Patch: Examine the validated solution. Note which files changed and the nature of the change (for example, logic correction, wrong variable, missing condition). Use this to judge whether a segment is moving toward the real solution

work page
[44]

YES" if you believe the first trajectory is better, and

Review the Trajectories: Read through the sequence of actions taken by the two trajectories and the results they received. Get a sense of the model’s approach so far. Step 2: Score the Candidate Actions by the Rubrics Carefully review the user’s question and the conversation context to understand both the task and the current resolution stage. Then, evalu...

work page
[51]

Conclude with a summary of the changes and finalize the resolution. How well does the action align with the ideal workflow above, addresses uncompleted critical steps, and focuses on the core of the user’s question and the task objectives? Higher scores for actions that advance new and essential steps in the workflow that have not yet been completed, and ...

work page
[52]

Running existing tests in the repository

work page
[53]

Identify and inspect the files relevant to the problem and its solution

work page
[54]

Create and execute a reproduction script (e.g., ‘reproduce_error.py‘) to recreate the error or problematic state, if feasible

work page
[55]

Edit relevant files to fix the bug or implement the required change

work page
[56]

Re-run the repository’s existing test cases and the reproduction script to confirm the issue is solved

work page
[57]

Develop and execute a more comprehensive test script (e.g., ‘comprehensive_tests.py‘) to check for edge cases

work page
[58]

Conclude with a summary of the changes and finalize the resolution. How well does the trajectory align with the ideal workflow above, addresses uncompleted critical steps, and focuses on the core of the user’s question and the task objectives? • A good trajectory should demonstrate a masterful adherence to the workflow. It prioritizes running tests to est...

work page