pith. machine review for the scientific record. sign in

arxiv: 2604.16335 · v1 · submitted 2026-03-13 · 💻 cs.LG · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SE
keywords generative reward modelreinforced fine-tuningSWE agentsrubric-based evaluationtrajectory filtrationLLM agentssoftware engineering tasksbehavioral patterns
0
0 comments X

The pith

Rubric-based generative reward models give richer signals than binary test outcomes for fine-tuning software engineering agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard fine-tuning of LLM agents on software engineering tasks depends on binary terminal rewards, such as whether all tests pass. This signal only judges the end result and gives no information about the quality of the intermediate steps in multi-turn interactions. The authors introduce a generative reward model that applies human-designed rubrics to score entire trajectories, identifying specific behavioral patterns to encourage or discourage. These scores are used to filter training data for reinforced fine-tuning. Experiments and case studies indicate that the rubric approach suppresses undesirable patterns more effectively than terminal-score rejection sampling and raises final test accuracy.

Core claim

A rubric-based Generative Reward Model equipped with human-designed criteria evaluates multi-step agent trajectories to filter high-quality data for Reinforced Fine-Tuning. This method outperforms terminal-score-only rejection sampling by more effectively suppressing undesirable behavioral patterns and promoting beneficial ones, as confirmed by case analyses, and improves final test accuracy on SWE tasks.

What carries the argument

Rubric-based Generative Reward Model (GRM) that scores full trajectories against human-designed criteria for specific behavioral patterns to enable targeted trajectory filtration.

If this is right

  • Rubric signals enable finer control over intermediate agent behaviors than binary terminal rewards alone.
  • Trajectory filtration using the GRM produces higher-quality training data for reinforced fine-tuning.
  • Suppression of undesirable patterns and promotion of beneficial ones both occur more reliably than with terminal-score sampling.
  • Final test accuracy on software engineering tasks improves as a direct result of the richer learning signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rubric approach could apply to other long-horizon agent tasks where outcome-only rewards leave process quality unguided.
  • Hybrid reward models that combine rubric criteria with verifiable terminal signals might further strengthen training.
  • Reducing reliance on fully human-crafted rubrics through automated rubric induction would make the method more scalable.
  • Process-level supervision of this form addresses a general limitation in outcome-only reinforcement learning for sequential decision tasks.

Load-bearing premise

Human-designed rubrics can correctly identify and penalize undesirable intermediate behaviors in multi-step trajectories without introducing new biases or overlooking important failure modes.

What would settle it

A head-to-head experiment on the same SWE benchmark where rubric-filtered RFT shows no gain in test accuracy or no reduction in undesirable patterns relative to terminal-score rejection sampling would falsify the central claim.

read the original abstract

Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a rubric-based Generative Reward Model (GRM) equipped with human-designed rubrics to generate richer learning signals for trajectory filtration in Reinforced Fine-Tuning (RFT) of LLM agents on Software Engineering (SWE) tasks. It claims this outperforms terminal-score-only rejection sampling by more effectively suppressing undesirable intermediate behavioral patterns and promoting beneficial ones (as confirmed by case analyses), ultimately improving final test accuracy.

Significance. If the empirical claims hold after proper validation, the work addresses a genuine limitation in agent fine-tuning by moving beyond binary terminal rewards to shape multi-step behaviors, which could improve reliability and quality in SWE agents and similar sequential decision tasks.

major comments (2)
  1. [Abstract] Abstract: The central claim of outperformance and improved final test accuracy is asserted without any quantitative results, baselines, error bars, dataset sizes, or experimental details, so the evidence for the claim cannot be evaluated from the provided text.
  2. [Abstract] The method rests on the assumption that human-designed rubrics accurately identify and penalize undesirable intermediate patterns without introducing new biases or missing critical failure modes; however, the text mentions only case analyses for confirmation and provides no inter-rater reliability, rubric ablation, or automated behavioral metric comparisons.
minor comments (2)
  1. Clarify the exact form of the GRM (e.g., how rubrics are encoded into prompts or loss terms) and the filtration criteria used for high-quality training data collection.
  2. Add a dedicated section or table summarizing the rubrics, with concrete examples of encouraged/discouraged behaviors in SWE trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, with revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of outperformance and improved final test accuracy is asserted without any quantitative results, baselines, error bars, dataset sizes, or experimental details, so the evidence for the claim cannot be evaluated from the provided text.

    Authors: We agree that the abstract should include key quantitative results to support immediate evaluation of the claims. In the revised manuscript, we will update the abstract to report specific improvements (e.g., accuracy gains over the terminal-score baseline on the primary SWE benchmark), dataset sizes, and main baselines. Full experimental details, including error bars across runs, remain in the Experiments section. revision: yes

  2. Referee: [Abstract] The method rests on the assumption that human-designed rubrics accurately identify and penalize undesirable intermediate patterns without introducing new biases or missing critical failure modes; however, the text mentions only case analyses for confirmation and provides no inter-rater reliability, rubric ablation, or automated behavioral metric comparisons.

    Authors: This is a valid concern. The original manuscript used case analyses to illustrate behavioral shaping. To strengthen the evidence, the revision will add inter-rater reliability metrics for the rubrics, rubric ablation results quantifying each component's contribution, and automated behavioral metrics for pattern suppression. These will appear in the main Experiments section and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external human rubrics and case analyses

full rationale

The paper presents a rubric-based GRM for trajectory filtration in RFT of SWE agents. No equations, derivations, or fitted parameters appear in the provided abstract or description. The approach uses independently designed human rubrics to filter trajectories and compares against terminal-score rejection sampling, with confirmation via case analyses. This does not reduce any claimed prediction to its inputs by construction, nor does it rely on self-citation chains or imported uniqueness theorems. The central claim remains empirically grounded in external criteria rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human rubrics can be written to reliably distinguish good and bad intermediate behaviors and that filtering trajectories with the GRM produces higher-quality training data than terminal rewards alone.

axioms (1)
  • domain assumption Human-designed rubrics can accurately evaluate and guide specific behavioral patterns in multi-step agent trajectories
    The method uses these rubrics for both reward modeling and trajectory filtration; the abstract treats their quality as given.
invented entities (1)
  • Rubric-based Generative Reward Model (GRM) no independent evidence
    purpose: To generate richer learning signals than binary terminal rewards for reinforced fine-tuning
    New model introduced to address the limitation of verifiable rewards; no independent evidence of its correctness is provided in the abstract.

pith-pipeline@v0.9.0 · 5469 in / 1335 out tokens · 47448 ms · 2026-05-15T12:19:15.862954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

  2. [2]

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  4. [4]

    Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

    Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan Ö Arık. Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    Rlef: Grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. arXiv preprint arXiv:2410.02089, 2024

  9. [9]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  10. [10]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  11. [11]

    Llm-rubric: A multidimen- sional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2024

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. Llm-rubric: A multidimen- sional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2024

  12. [12]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024.URL https://arxiv. org/abs/2310.06770, 7, 2023

  13. [13]

    Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024

  14. [14]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 35:21314–21328, 2022

  15. [15]

    Generative judge for evaluating alignment

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023

  16. [16]

    Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977, 2024

  17. [17]

    Inference-time scaling for generalist reward modeling, 2025

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025. URLhttps://arxiv.org/abs/2504.02495. 10

  18. [18]

    Generative reward models.arXiv preprint arXiv:2410.12832, 2024

    Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models.arXiv preprint arXiv:2410.12832, 2024

  19. [19]

    Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv. org/abs/2412.21139

  20. [20]

    Rubric is all you need: Enhancing llm-based code evaluation with question-specific rubrics.arXiv preprint arXiv:2503.23989, 2025

    Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, et al. Rubric is all you need: Enhancing llm-based code evaluation with question-specific rubrics.arXiv preprint arXiv:2503.23989, 2025

  21. [21]

    ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  23. [23]

    Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674, 2024

    Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models.arXiv preprint arXiv:2412.02674, 2024

  24. [24]

    Salmon: Self-alignment with instructable reward models.arXiv preprint arXiv:2310.05910, 2023

    Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Salmon: Self-alignment with instructable reward models.arXiv preprint arXiv:2310.05910, 2023

  25. [25]

    Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

    ByteDance Seed Team. Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

  26. [26]

    Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, 2024

  27. [27]

    A survey on large language model based autonomous agents.Frontiersof Computer Science, 18 (6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiersof Computer Science, 18 (6):186345, 2024

  28. [28]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

  29. [29]

    Automated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

  30. [30]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  31. [31]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advancesin Neural Information Processing Systems, 37:50528–50652, 2024

  32. [32]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

  33. [33]

    Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

    Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.arXiv preprint arXiv:2304.10778, 2023

  34. [34]

    Ovm, outcome-supervised value models for planning in mathematical reasoning

    Fei Yu, Anningzhe Gao, and Benyou Wang. Ovm, outcome-supervised value models for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023

  35. [35]

    Rewardanything: Generalizable principle-following reward models.arXiv preprint arXiv:2506.03637, 2025

    Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, and Wei Ye. Rewardanything: Generalizable principle-following reward models.arXiv preprint arXiv:2506.03637, 2025

  36. [36]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.arXiv preprint arXiv:2401.10020, 3, 2024. 11

  37. [37]

    Zhang, J

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339, 2024

  38. [38]

    len(responses)

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024. 12 Appendix A Details in GRM Prompt Engineering A.1 GRM Prompt Structures For turn-level GRM, we compare all the actions toget...

  39. [41]

    Trajectory

    Review the Conversation History: Read through the sequence of actions taken by the LLM and the results it received. Get a sense of the model’s approach so far. Step 2: Score the Candidate Actions by the Rubrics Carefully review the user’s question and the conversation context to understand both the task and the current resolution stage. Then, evaluate eac...

  40. [42]

    Identify the reported bug or requested feature and the exact solution requirements

    Analyze the User Instruction: Read the GitHub issue carefully. Identify the reported bug or requested feature and the exact solution requirements

  41. [43]

    Note which files changed and the nature of the change (for example, logic correction, wrong variable, missing condition)

    Study the Ground-Truth Patch: Examine the validated solution. Note which files changed and the nature of the change (for example, logic correction, wrong variable, missing condition). Use this to judge whether a segment is moving toward the real solution

  42. [44]

    YES" if you believe the first trajectory is better, and

    Review the Trajectories: Read through the sequence of actions taken by the two trajectories and the results they received. Get a sense of the model’s approach so far. Step 2: Score the Candidate Actions by the Rubrics Carefully review the user’s question and the conversation context to understand both the task and the current resolution stage. Then, evalu...

  43. [51]

    Conclude with a summary of the changes and finalize the resolution. How well does the action align with the ideal workflow above, addresses uncompleted critical steps, and focuses on the core of the user’s question and the task objectives? Higher scores for actions that advance new and essential steps in the workflow that have not yet been completed, and ...

  44. [52]

    Running existing tests in the repository

  45. [53]

    Identify and inspect the files relevant to the problem and its solution

  46. [54]

    Create and execute a reproduction script (e.g., ‘reproduce_error.py‘) to recreate the error or problematic state, if feasible

  47. [55]

    Edit relevant files to fix the bug or implement the required change

  48. [56]

    Re-run the repository’s existing test cases and the reproduction script to confirm the issue is solved

  49. [57]

    Develop and execute a more comprehensive test script (e.g., ‘comprehensive_tests.py‘) to check for edge cases

  50. [58]

    Conclude with a summary of the changes and finalize the resolution. How well does the trajectory align with the ideal workflow above, addresses uncompleted critical steps, and focuses on the core of the user’s question and the task objectives? • A good trajectory should demonstrate a masterful adherence to the workflow. It prioritizes running tests to est...