SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Pith reviewed 2026-05-16 20:21 UTC · model grok-4.3
The pith
Current AI coding agents achieve only 25 percent success on long-horizon software evolution tasks compared to 73 percent on single-issue benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-EVO shows that current agents struggle with sustained multi-file reasoning required for long-horizon software evolution. The benchmark contains 48 tasks constructed from release notes of seven mature open-source Python projects; each task demands coordinated modifications spanning an average of 21 files and must pass test suites averaging 874 tests per instance. GPT-5.4 paired with OpenHands reaches only 25 percent success on SWE-EVO while the same class of agents reaches 72.80 percent on SWE-Bench Verified, and the paper introduces Fix Rate to capture incremental progress on these harder tasks.
What carries the argument
SWE-EVO benchmark of 48 tasks extracted from real release notes that each require multi-step modifications across an average of 21 files and validated against large test suites.
If this is right
- Agents require improved mechanisms for tracking dependencies and coordinating changes across many files over multiple iterations.
- Training on isolated single-file tasks does not transfer effectively to iterative codebase evolution.
- Fix Rate offers a practical way to measure and reward partial progress on long-horizon tasks.
- Future benchmarks should prioritize long-horizon scenarios to better match real development workflows.
- The observed gap indicates that current agent designs miss key capabilities for preserving functionality while evolving codebases.
Where Pith is reading between the lines
- Agent designs may need explicit planning layers that maintain a global view of file interactions across iterations.
- The benchmark could be extended to non-Python projects to test whether the performance drop is language-specific.
- Hybrid systems that combine agent reasoning with automated dependency analysis might close part of the observed gap.
- Developers could use SWE-EVO-style tasks to evaluate whether an agent is ready for incremental production changes.
Load-bearing premise
Tasks extracted from release notes of seven mature projects accurately represent the distribution and difficulty of real-world long-horizon software evolution.
What would settle it
An agent architecture that reaches success rates above 70 percent on the full SWE-EVO suite without changes to its core design would falsify the claim of a fundamental limitation in sustained multi-file reasoning.
read the original abstract
Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWE-EVO, a benchmark of 48 long-horizon software evolution tasks derived from release notes of seven mature Python projects. These tasks involve multi-step modifications across an average of 21 files and are validated using test suites averaging 874 tests. Experiments show GPT-5.4 with OpenHands achieving 25% success on SWE-EVO, in contrast to 72.80% by GPT-5.2 on SWE-Bench Verified, highlighting challenges in sustained multi-file reasoning, and proposes the Fix Rate metric for partial progress.
Significance. Should the benchmark construction prove robust and the observed performance gap be attributable to task horizon rather than model or framework differences, this work would provide valuable evidence of limitations in current coding agents for realistic software evolution scenarios. The introduction of a partial progress metric like Fix Rate could aid in evaluating incremental improvements in agent capabilities.
major comments (2)
- [Experiments] The central performance comparison pits GPT-5.4 + OpenHands on SWE-EVO against GPT-5.2 on SWE-Bench Verified. This setup does not control for model version differences, leaving open whether the 25% vs 72.80% gap stems from the long-horizon nature of the tasks or from inherent differences between the models.
- [Benchmark Construction] The paper lacks specific details on how release notes were filtered to create tasks, how task validity was ensured, and how partial success was quantified for the Fix Rate metric. These omissions undermine the ability to fully assess the benchmark's soundness and reproducibility.
minor comments (1)
- [Abstract] Clarify the exact versions or naming of 'GPT-5.4' and 'GPT-5.2' to avoid confusion with actual model releases.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and rigor.
read point-by-point responses
-
Referee: [Experiments] The central performance comparison pits GPT-5.4 + OpenHands on SWE-EVO against GPT-5.2 on SWE-Bench Verified. This setup does not control for model version differences, leaving open whether the 25% vs 72.80% gap stems from the long-horizon nature of the tasks or from inherent differences between the models.
Authors: We acknowledge that the model versions are not identical. However, GPT-5.4 is a newer release than GPT-5.2; if anything, this should bias results in favor of stronger performance on SWE-EVO. The large observed gap therefore provides evidence that long-horizon, multi-file evolution tasks remain challenging even for improved models. To strengthen the comparison, we will add results for GPT-5.4 on SWE-Bench Verified in the revised manuscript. revision: yes
-
Referee: [Benchmark Construction] The paper lacks specific details on how release notes were filtered to create tasks, how task validity was ensured, and how partial success was quantified for the Fix Rate metric. These omissions undermine the ability to fully assess the benchmark's soundness and reproducibility.
Authors: We agree that these details are essential. The revised manuscript will expand the benchmark construction section with explicit criteria for filtering release notes, the process used to ensure task validity (including test-suite executability and coverage checks), and a precise definition of Fix Rate together with its calculation and examples of partial-progress scoring. revision: yes
Circularity Check
No circularity: benchmark built from external sources with independent validation
full rationale
The paper constructs SWE-EVO tasks directly from release notes and test suites of seven external open-source Python projects, then reports agent performance against the separately published SWE-Bench Verified benchmark. No equations, fitted parameters, self-definitions, or load-bearing self-citations appear in the derivation of the benchmark or the reported gap; all inputs are externally sourced and the central claim rests on empirical measurement rather than internal reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 10 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.
-
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
-
ProgramBench: Can Language Models Rebuild Programs From Scratch?
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
-
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025
-
[4]
Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm. arXiv preprint arXiv:2306.00029, 2023
-
[5]
Large language models as tool makers
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In International Conference on Representation Learning, 2024
work page 2024
-
[6]
Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025
-
[7]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
CodeT: Code Generation with Generated Tests
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Teaching Large Language Models to Self-Debug
Xinyun Chen, Maxwell Lin, Nathanael Sch \"a rli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Cognition . Devin ai, 2024. https://cognition.ai/blog/introducing-devin
work page 2024
-
[13]
DeepSeek AI . DeepSeek V3.1 , 2025. https://api-docs.deepseek.com/news/news250821
work page 2025
-
[14]
State of ai-assisted software development: 2025 dora report
DORA Research Program . State of ai-assisted software development: 2025 dora report. https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report, September 2025. DevOps Research and Assessment (DORA)
work page 2025
-
[15]
Large language models for software engineering: Survey and open problems
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31--53. IEEE, 2023
work page 2023
-
[16]
The current challenges of software engineering in the era of large language models
Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. The current challenges of software engineering in the era of large language models. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025
work page 2025
-
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025
work page 2025
-
[19]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi DQ Bui. Codewiki: Evaluating ai's ability to generate holistic documentation for large-scale codebases. arXiv preprint arXiv:2510.24428, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Testgeneval: A real world unit test generation and test completion benchmark
Kush Jain, Gabriel Synnaeve, and Baptiste Roziere. Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752, 2024
-
[23]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[25]
A review on software maintenance issues and how to reduce maintenance efforts
Uttamjit Kaur and Gagandeep Singh. A review on software maintenance issues and how to reduce maintenance efforts. International Journal of Computer Applications, 118 0 (1): 0 6--11, 2015
work page 2015
-
[26]
Kimi K2: Open Agentic Intelligence
Kimi Team . Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Competition-level code generation with alphacode
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022
work page 2022
-
[28]
Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025
Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Shang Zhu Tarun Venkat, Ben Athiwaratkun, et al. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025
work page 2025
-
[29]
The vault: A comprehensive multilingual dataset for advancing code understanding and generation
Dung Nguyen Manh, Nam Le Hai, Anh TV Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, and Nghi DQ Bui. The vault: A comprehensive multilingual dataset for advancing code understanding and generation. arXiv preprint arXiv:2305.06156, 2023
-
[30]
Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025 a
work page 2025
-
[31]
Agilecoder: Dynamic collaborative agents for software development based on agile methodology
Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156--167. IEEE, 2025 b
work page 2025
-
[32]
OpenAI . Gpt-4o system card, 2024 a . URL https://openai.com/index/hello-gpt-4o/
work page 2024
-
[33]
OpenAI . Swe-bench verified, 2024 b . https://openai.com/index/introducing-swe-bench-verified/
work page 2024
-
[34]
Introducing gpt-4.1 in the api, 2025 a
OpenAI . Introducing gpt-4.1 in the api, 2025 a . URL https://openai.com/index/gpt-4-1/
work page 2025
-
[35]
OpenAI . Gpt-5 system card. Technical report, OpenAI, 2025 b . URL https://cdn.openai.com/gpt-5-system-card.pdf
work page 2025
-
[36]
Openai o3 and o4-mini system card
OpenAI . Openai o3 and o4-mini system card. Technical report, OpenAI, 2025 c . URL https://openai.com/index/o3-o4-mini-system-card/
work page 2025
-
[37]
Training Software Engineering Agents and Verifiers with SWE-Gym
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. URL https://arxiv.org/abs/2412.21139
work page internal anchor Pith review arXiv 2024
-
[38]
Minh VT Pham, Huy N Phan, Hoang N Phan, Cuong Le Chi, Tien N Nguyen, and Nghi DQ Bui. Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs. arXiv preprint arXiv:2504.14757, 2025
-
[39]
arXiv preprint arXiv:2409.16299
Huy Nhat Phan, Phong X. Nguyen, and Nghi D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale. CoRR, abs/2409.16299, 2024. doi:10.48550/ARXIV.2409.16299. URL https://doi.org/10.48550/arXiv.2409.16299
-
[40]
Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji
Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models, 2024
work page 2024
-
[41]
Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025
work page 2025
-
[42]
Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
A self-improving coding agent.arXiv preprint arXiv:2504.15228,
Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent. arXiv preprint arXiv:2504.15228, 2025
-
[44]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Scale AI . Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Analysis of software maintenance cost affecting factors and estimation models
Chamkaur Singh, Neeraj Sharma, and Narender Kumar. Analysis of software maintenance cost affecting factors and estimation models. Int. J. Sci. Technol. Res, 8 0 (9): 0 276--281, 2019
work page 2019
-
[46]
Functional overlap reranking for neural code generation
Hung Quoc To, Minh Huynh Nguyen, and Nghi DQ Bui. Functional overlap reranking for neural code generation. arXiv preprint arXiv:2311.03366, 2023
-
[47]
Voyager: An open-ended embodied agent with large language models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a
work page 2024
-
[48]
Testeval: Benchmarking large language models for test case generation
Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531, 2024 b
-
[49]
u rgen Schmidhuber. Huxley-g \
Wenyi Wang, Piotr Pi e kos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and J \"u rgen Schmidhuber. Huxley-g \"o del machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025
work page 2025
-
[50]
Executable code actions elicit better llm agents
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024 c
work page 2024
-
[51]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024 d
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023
work page internal anchor Pith review arXiv 2023
-
[53]
Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e
Zhiruo Wang, Daniel Fried, and Graham Neubig. Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e
work page 2024
-
[54]
Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023
-
[55]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Conversational automated program repair,
Chunqiu Steven Xia and Lingming Zhang. Conversational automated program repair. arXiv preprint arXiv:2301.13246, 2023
-
[57]
Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt
Chunqiu Steven Xia and Lingming Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 819--831, 2024
work page 2024
-
[58]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024 b
-
[61]
Evaluating and improving chatgpt for unit test generation
Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 1703--1726, 2024
work page 2024
-
[62]
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving. arXiv preprint arXiv:2504.02605, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents. arXiv preprint arXiv:2505.22954, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025
Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025 b
-
[66]
Available: https://arxiv.org/abs/2312.15223
Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering. arXiv preprint arXiv:2312.15223, 2023
-
[67]
Autocoderover: Autonomous program improvement
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592--1604, Vienna, Austria, 2024. ACM
work page 2024
-
[68]
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023
work page internal anchor Pith review arXiv 2023
-
[69]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.