pith. sign in

arxiv: 2512.18470 · v5 · submitted 2025-12-20 · 💻 cs.SE · cs.AI· cs.MA

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Pith reviewed 2026-05-16 20:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA
keywords coding agentssoftware evolutionbenchmarklong-horizon tasksmulti-file changesSWE-EVOperformance evaluationFix Rate
0
0 comments X

The pith

Current AI coding agents achieve only 25 percent success on long-horizon software evolution tasks compared to 73 percent on single-issue benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-EVO, a benchmark that tests AI coding agents on realistic tasks drawn from actual software release notes. These tasks require interpreting high-level requirements and coordinating changes across an average of 21 files over multiple iterations while preserving overall functionality. Experiments show a sharp performance drop: GPT-5.4 with OpenHands solves only 25 percent of SWE-EVO instances versus 72.80 percent on the simpler SWE-Bench Verified. The authors also define Fix Rate, a metric that measures partial progress toward completing these complex tasks. The work highlights that existing agents lack the sustained, multi-file reasoning needed for genuine software evolution.

Core claim

SWE-EVO shows that current agents struggle with sustained multi-file reasoning required for long-horizon software evolution. The benchmark contains 48 tasks constructed from release notes of seven mature open-source Python projects; each task demands coordinated modifications spanning an average of 21 files and must pass test suites averaging 874 tests per instance. GPT-5.4 paired with OpenHands reaches only 25 percent success on SWE-EVO while the same class of agents reaches 72.80 percent on SWE-Bench Verified, and the paper introduces Fix Rate to capture incremental progress on these harder tasks.

What carries the argument

SWE-EVO benchmark of 48 tasks extracted from real release notes that each require multi-step modifications across an average of 21 files and validated against large test suites.

If this is right

  • Agents require improved mechanisms for tracking dependencies and coordinating changes across many files over multiple iterations.
  • Training on isolated single-file tasks does not transfer effectively to iterative codebase evolution.
  • Fix Rate offers a practical way to measure and reward partial progress on long-horizon tasks.
  • Future benchmarks should prioritize long-horizon scenarios to better match real development workflows.
  • The observed gap indicates that current agent designs miss key capabilities for preserving functionality while evolving codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designs may need explicit planning layers that maintain a global view of file interactions across iterations.
  • The benchmark could be extended to non-Python projects to test whether the performance drop is language-specific.
  • Hybrid systems that combine agent reasoning with automated dependency analysis might close part of the observed gap.
  • Developers could use SWE-EVO-style tasks to evaluate whether an agent is ready for incremental production changes.

Load-bearing premise

Tasks extracted from release notes of seven mature projects accurately represent the distribution and difficulty of real-world long-horizon software evolution.

What would settle it

An agent architecture that reaches success rates above 70 percent on the full SWE-EVO suite without changes to its core design would falsify the claim of a fundamental limitation in sustained multi-file reasoning.

read the original abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SWE-EVO, a benchmark of 48 long-horizon software evolution tasks derived from release notes of seven mature Python projects. These tasks involve multi-step modifications across an average of 21 files and are validated using test suites averaging 874 tests. Experiments show GPT-5.4 with OpenHands achieving 25% success on SWE-EVO, in contrast to 72.80% by GPT-5.2 on SWE-Bench Verified, highlighting challenges in sustained multi-file reasoning, and proposes the Fix Rate metric for partial progress.

Significance. Should the benchmark construction prove robust and the observed performance gap be attributable to task horizon rather than model or framework differences, this work would provide valuable evidence of limitations in current coding agents for realistic software evolution scenarios. The introduction of a partial progress metric like Fix Rate could aid in evaluating incremental improvements in agent capabilities.

major comments (2)
  1. [Experiments] The central performance comparison pits GPT-5.4 + OpenHands on SWE-EVO against GPT-5.2 on SWE-Bench Verified. This setup does not control for model version differences, leaving open whether the 25% vs 72.80% gap stems from the long-horizon nature of the tasks or from inherent differences between the models.
  2. [Benchmark Construction] The paper lacks specific details on how release notes were filtered to create tasks, how task validity was ensured, and how partial success was quantified for the Fix Rate metric. These omissions undermine the ability to fully assess the benchmark's soundness and reproducibility.
minor comments (1)
  1. [Abstract] Clarify the exact versions or naming of 'GPT-5.4' and 'GPT-5.2' to avoid confusion with actual model releases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Experiments] The central performance comparison pits GPT-5.4 + OpenHands on SWE-EVO against GPT-5.2 on SWE-Bench Verified. This setup does not control for model version differences, leaving open whether the 25% vs 72.80% gap stems from the long-horizon nature of the tasks or from inherent differences between the models.

    Authors: We acknowledge that the model versions are not identical. However, GPT-5.4 is a newer release than GPT-5.2; if anything, this should bias results in favor of stronger performance on SWE-EVO. The large observed gap therefore provides evidence that long-horizon, multi-file evolution tasks remain challenging even for improved models. To strengthen the comparison, we will add results for GPT-5.4 on SWE-Bench Verified in the revised manuscript. revision: yes

  2. Referee: [Benchmark Construction] The paper lacks specific details on how release notes were filtered to create tasks, how task validity was ensured, and how partial success was quantified for the Fix Rate metric. These omissions undermine the ability to fully assess the benchmark's soundness and reproducibility.

    Authors: We agree that these details are essential. The revised manuscript will expand the benchmark construction section with explicit criteria for filtering release notes, the process used to ensure task validity (including test-suite executability and coverage checks), and a precise definition of Fix Rate together with its calculation and examples of partial-progress scoring. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark built from external sources with independent validation

full rationale

The paper constructs SWE-EVO tasks directly from release notes and test suites of seven external open-source Python projects, then reports agent performance against the separately published SWE-Bench Verified benchmark. No equations, fitted parameters, self-definitions, or load-bearing self-citations appear in the derivation of the benchmark or the reported gap; all inputs are externally sourced and the central claim rests on empirical measurement rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that release-note-derived tasks faithfully represent long-horizon evolution and that test suites provide adequate validation; no free parameters or invented entities are introduced.

pith-pipeline@v0.9.0 · 5489 in / 1002 out tokens · 22181 ms · 2026-05-16T20:21:37.572286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

    cs.SE 2026-05 unverdicted novelty 7.0

    SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

  3. PBT-Bench: Benchmarking AI Agents on Property-Based Testing

    cs.SE 2026-05 unverdicted novelty 7.0

    PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.

  4. PBT-Bench: Benchmarking AI Agents on Property-Based Testing

    cs.SE 2026-05 conditional novelty 7.0

    PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.

  5. Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

    cs.SE 2026-05 unverdicted novelty 7.0

    TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.

  6. ProgramBench: Can Language Models Rebuild Programs From Scratch?

    cs.SE 2026-05 unverdicted novelty 7.0

    ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...

  7. Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

    cs.AI 2026-04 unverdicted novelty 7.0

    A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

  8. The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 5.0

    Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.

  9. More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

    cs.SE 2026-04 unverdicted novelty 5.0

    AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.

  10. AI for Auto-Research: Roadmap & User Guide

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 9 Pith papers · 25 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025

  4. [4]

    Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

    Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm. arXiv preprint arXiv:2306.00029, 2023

  5. [5]

    Large language models as tool makers

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In International Conference on Representation Learning, 2024

  6. [6]

    Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

    Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

  7. [7]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

  8. [8]

    CodeT: Code Generation with Generated Tests

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022

  9. [10]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b

  10. [11]

    Teaching Large Language Models to Self-Debug

    Xinyun Chen, Maxwell Lin, Nathanael Sch \"a rli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

  11. [12]

    Devin ai, 2024

    Cognition . Devin ai, 2024. https://cognition.ai/blog/introducing-devin

  12. [13]

    DeepSeek V3.1 , 2025

    DeepSeek AI . DeepSeek V3.1 , 2025. https://api-docs.deepseek.com/news/news250821

  13. [14]

    State of ai-assisted software development: 2025 dora report

    DORA Research Program . State of ai-assisted software development: 2025 dora report. https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report, September 2025. DevOps Research and Assessment (DORA)

  14. [15]

    Large language models for software engineering: Survey and open problems

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31--53. IEEE, 2023

  15. [16]

    The current challenges of software engineering in the era of large language models

    Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. The current challenges of software engineering in the era of large language models. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025

  16. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  17. [18]

    Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead

    Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025

  18. [19]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

  19. [20]

    CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

    Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi DQ Bui. Codewiki: Evaluating ai's ability to generate holistic documentation for large-scale codebases. arXiv preprint arXiv:2510.24428, 2025

  20. [21]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023

  21. [22]

    Testgeneval: A real world unit test generation and test completion benchmark

    Kush Jain, Gabriel Synnaeve, and Baptiste Roziere. Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752, 2024

  22. [23]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  23. [24]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  24. [25]

    A review on software maintenance issues and how to reduce maintenance efforts

    Uttamjit Kaur and Gagandeep Singh. A review on software maintenance issues and how to reduce maintenance efforts. International Journal of Computer Applications, 118 0 (1): 0 6--11, 2015

  25. [26]

    Kimi K2: Open Agentic Intelligence

    Kimi Team . Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

  26. [27]

    Competition-level code generation with alphacode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

  27. [28]

    Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025

    Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Shang Zhu Tarun Venkat, Ben Athiwaratkun, et al. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025

  28. [29]

    The vault: A comprehensive multilingual dataset for advancing code understanding and generation

    Dung Nguyen Manh, Nam Le Hai, Anh TV Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, and Nghi DQ Bui. The vault: A comprehensive multilingual dataset for advancing code understanding and generation. arXiv preprint arXiv:2305.06156, 2023

  29. [30]

    Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms

    Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025 a

  30. [31]

    Agilecoder: Dynamic collaborative agents for software development based on agile methodology

    Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156--167. IEEE, 2025 b

  31. [32]

    Gpt-4o system card, 2024 a

    OpenAI . Gpt-4o system card, 2024 a . URL https://openai.com/index/hello-gpt-4o/

  32. [33]

    Swe-bench verified, 2024 b

    OpenAI . Swe-bench verified, 2024 b . https://openai.com/index/introducing-swe-bench-verified/

  33. [34]

    Introducing gpt-4.1 in the api, 2025 a

    OpenAI . Introducing gpt-4.1 in the api, 2025 a . URL https://openai.com/index/gpt-4-1/

  34. [35]

    Gpt-5 system card

    OpenAI . Gpt-5 system card. Technical report, OpenAI, 2025 b . URL https://cdn.openai.com/gpt-5-system-card.pdf

  35. [36]

    Openai o3 and o4-mini system card

    OpenAI . Openai o3 and o4-mini system card. Technical report, OpenAI, 2025 c . URL https://openai.com/index/o3-o4-mini-system-card/

  36. [37]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. URL https://arxiv.org/abs/2412.21139

  37. [38]

    Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs

    Minh VT Pham, Huy N Phan, Hoang N Phan, Cuong Le Chi, Tien N Nguyen, and Nghi DQ Bui. Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs. arXiv preprint arXiv:2504.14757, 2025

  38. [39]

    arXiv preprint arXiv:2409.16299

    Huy Nhat Phan, Phong X. Nguyen, and Nghi D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale. CoRR, abs/2409.16299, 2024. doi:10.48550/ARXIV.2409.16299. URL https://doi.org/10.48550/arXiv.2409.16299

  39. [40]

    Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

    Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models, 2024

  40. [41]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

  41. [42]

    Qwen3 Technical Report

    Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  42. [43]

    A self-improving coding agent.arXiv preprint arXiv:2504.15228,

    Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent. arXiv preprint arXiv:2504.15228, 2025

  43. [44]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Scale AI . Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

  44. [45]

    Analysis of software maintenance cost affecting factors and estimation models

    Chamkaur Singh, Neeraj Sharma, and Narender Kumar. Analysis of software maintenance cost affecting factors and estimation models. Int. J. Sci. Technol. Res, 8 0 (9): 0 276--281, 2019

  45. [46]

    Functional overlap reranking for neural code generation

    Hung Quoc To, Minh Huynh Nguyen, and Nghi DQ Bui. Functional overlap reranking for neural code generation. arXiv preprint arXiv:2311.03366, 2023

  46. [47]

    Voyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a

  47. [48]

    Testeval: Benchmarking large language models for test case generation

    Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531, 2024 b

  48. [49]

    u rgen Schmidhuber. Huxley-g \

    Wenyi Wang, Piotr Pi e kos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and J \"u rgen Schmidhuber. Huxley-g \"o del machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025

  49. [50]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024 c

  50. [51]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024 d

  51. [52]

    CodeT5+: Open Code Large Language Models for Code Understanding and Generation

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023

  52. [53]

    Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e

    Zhiruo Wang, Daniel Fried, and Graham Neubig. Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e

  53. [54]

    Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023

  54. [55]

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025

  55. [56]

    Conversational automated program repair,

    Chunqiu Steven Xia and Lingming Zhang. Conversational automated program repair. arXiv preprint arXiv:2301.13246, 2023

  56. [57]

    Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt

    Chunqiu Steven Xia and Lingming Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 819--831, 2024

  57. [58]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024

  58. [59]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024 a

  59. [60]

    arXiv.org

    John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024 b

  60. [61]

    Evaluating and improving chatgpt for unit test generation

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 1703--1726, 2024

  61. [62]

    Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving. arXiv preprint arXiv:2504.02605, 2025

  62. [63]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

  63. [64]

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents. arXiv preprint arXiv:2505.22954, 2025 a

  64. [65]

    Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

    Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025 b

  65. [66]

    Available: https://arxiv.org/abs/2312.15223

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering. arXiv preprint arXiv:2312.15223, 2023

  66. [67]

    Autocoderover: Autonomous program improvement

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592--1604, Vienna, Austria, 2024. ACM

  67. [68]

    Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023

  68. [69]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024