SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Dung Nguyen Manh; Huy Phan Nhat; Minh V. T. Thai; Nghi D. Q. Bui; Tue Le

arxiv: 2512.18470 · v5 · submitted 2025-12-20 · 💻 cs.SE · cs.AI· cs.MA

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Minh V. T. Thai , Tue Le , Dung Nguyen Manh , Huy Phan Nhat , Nghi D. Q. Bui This is my paper

Pith reviewed 2026-05-16 20:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA

keywords coding agentssoftware evolutionbenchmarklong-horizon tasksmulti-file changesSWE-EVOperformance evaluationFix Rate

0 comments

The pith

Current AI coding agents achieve only 25 percent success on long-horizon software evolution tasks compared to 73 percent on single-issue benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-EVO, a benchmark that tests AI coding agents on realistic tasks drawn from actual software release notes. These tasks require interpreting high-level requirements and coordinating changes across an average of 21 files over multiple iterations while preserving overall functionality. Experiments show a sharp performance drop: GPT-5.4 with OpenHands solves only 25 percent of SWE-EVO instances versus 72.80 percent on the simpler SWE-Bench Verified. The authors also define Fix Rate, a metric that measures partial progress toward completing these complex tasks. The work highlights that existing agents lack the sustained, multi-file reasoning needed for genuine software evolution.

Core claim

SWE-EVO shows that current agents struggle with sustained multi-file reasoning required for long-horizon software evolution. The benchmark contains 48 tasks constructed from release notes of seven mature open-source Python projects; each task demands coordinated modifications spanning an average of 21 files and must pass test suites averaging 874 tests per instance. GPT-5.4 paired with OpenHands reaches only 25 percent success on SWE-EVO while the same class of agents reaches 72.80 percent on SWE-Bench Verified, and the paper introduces Fix Rate to capture incremental progress on these harder tasks.

What carries the argument

SWE-EVO benchmark of 48 tasks extracted from real release notes that each require multi-step modifications across an average of 21 files and validated against large test suites.

If this is right

Agents require improved mechanisms for tracking dependencies and coordinating changes across many files over multiple iterations.
Training on isolated single-file tasks does not transfer effectively to iterative codebase evolution.
Fix Rate offers a practical way to measure and reward partial progress on long-horizon tasks.
Future benchmarks should prioritize long-horizon scenarios to better match real development workflows.
The observed gap indicates that current agent designs miss key capabilities for preserving functionality while evolving codebases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs may need explicit planning layers that maintain a global view of file interactions across iterations.
The benchmark could be extended to non-Python projects to test whether the performance drop is language-specific.
Hybrid systems that combine agent reasoning with automated dependency analysis might close part of the observed gap.
Developers could use SWE-EVO-style tasks to evaluate whether an agent is ready for incremental production changes.

Load-bearing premise

Tasks extracted from release notes of seven mature projects accurately represent the distribution and difficulty of real-world long-horizon software evolution.

What would settle it

An agent architecture that reaches success rates above 70 percent on the full SWE-EVO suite without changes to its core design would falsify the claim of a fundamental limitation in sustained multi-file reasoning.

read the original abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-EVO gives a grounded long-horizon benchmark from real release notes, but the 25% vs 72% gap mixes different models so it does not cleanly show horizon length as the cause.

read the letter

The paper's main contribution is SWE-EVO, a set of 48 tasks drawn from release notes of seven mature Python projects. Each task requires coordinated changes across an average of 21 files and is checked against test suites of roughly 874 tests. They also define Fix Rate to score partial progress instead of binary success. That setup moves past the single-issue focus of SWE-Bench and uses actual project histories, which is a practical step forward for agent evaluation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SWE-EVO, a benchmark of 48 long-horizon software evolution tasks derived from release notes of seven mature Python projects. These tasks involve multi-step modifications across an average of 21 files and are validated using test suites averaging 874 tests. Experiments show GPT-5.4 with OpenHands achieving 25% success on SWE-EVO, in contrast to 72.80% by GPT-5.2 on SWE-Bench Verified, highlighting challenges in sustained multi-file reasoning, and proposes the Fix Rate metric for partial progress.

Significance. Should the benchmark construction prove robust and the observed performance gap be attributable to task horizon rather than model or framework differences, this work would provide valuable evidence of limitations in current coding agents for realistic software evolution scenarios. The introduction of a partial progress metric like Fix Rate could aid in evaluating incremental improvements in agent capabilities.

major comments (2)

[Experiments] The central performance comparison pits GPT-5.4 + OpenHands on SWE-EVO against GPT-5.2 on SWE-Bench Verified. This setup does not control for model version differences, leaving open whether the 25% vs 72.80% gap stems from the long-horizon nature of the tasks or from inherent differences between the models.
[Benchmark Construction] The paper lacks specific details on how release notes were filtered to create tasks, how task validity was ensured, and how partial success was quantified for the Fix Rate metric. These omissions undermine the ability to fully assess the benchmark's soundness and reproducibility.

minor comments (1)

[Abstract] Clarify the exact versions or naming of 'GPT-5.4' and 'GPT-5.2' to avoid confusion with actual model releases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Experiments] The central performance comparison pits GPT-5.4 + OpenHands on SWE-EVO against GPT-5.2 on SWE-Bench Verified. This setup does not control for model version differences, leaving open whether the 25% vs 72.80% gap stems from the long-horizon nature of the tasks or from inherent differences between the models.

Authors: We acknowledge that the model versions are not identical. However, GPT-5.4 is a newer release than GPT-5.2; if anything, this should bias results in favor of stronger performance on SWE-EVO. The large observed gap therefore provides evidence that long-horizon, multi-file evolution tasks remain challenging even for improved models. To strengthen the comparison, we will add results for GPT-5.4 on SWE-Bench Verified in the revised manuscript. revision: yes
Referee: [Benchmark Construction] The paper lacks specific details on how release notes were filtered to create tasks, how task validity was ensured, and how partial success was quantified for the Fix Rate metric. These omissions undermine the ability to fully assess the benchmark's soundness and reproducibility.

Authors: We agree that these details are essential. The revised manuscript will expand the benchmark construction section with explicit criteria for filtering release notes, the process used to ensure task validity (including test-suite executability and coverage checks), and a precise definition of Fix Rate together with its calculation and examples of partial-progress scoring. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark built from external sources with independent validation

full rationale

The paper constructs SWE-EVO tasks directly from release notes and test suites of seven external open-source Python projects, then reports agent performance against the separately published SWE-Bench Verified benchmark. No equations, fitted parameters, self-definitions, or load-bearing self-citations appear in the derivation of the benchmark or the reported gap; all inputs are externally sourced and the central claim rests on empirical measurement rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that release-note-derived tasks faithfully represent long-horizon evolution and that test suites provide adequate validation; no free parameters or invented entities are introduced.

pith-pipeline@v0.9.0 · 5489 in / 1002 out tokens · 22181 ms · 2026-05-16T20:21:37.572286+00:00 · methodology

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
cs.SE 2026-05 unverdicted novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
cs.SE 2026-05 unverdicted novelty 7.0

PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
cs.SE 2026-05 conditional novelty 7.0

PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
cs.SE 2026-05 unverdicted novelty 7.0

TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
cs.AI 2026-04 unverdicted novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 5.0

Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
cs.SE 2026-04 unverdicted novelty 5.0

AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 9 Pith papers · 25 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025

work page arXiv 2025
[4]

Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm. arXiv preprint arXiv:2306.00029, 2023

work page arXiv 2023
[5]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In International Conference on Representation Learning, 2024

work page 2024
[6]

Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

work page arXiv 2025
[7]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

CodeT: Code Generation with Generated Tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch \"a rli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Devin ai, 2024

Cognition . Devin ai, 2024. https://cognition.ai/blog/introducing-devin

work page 2024
[13]

DeepSeek V3.1 , 2025

DeepSeek AI . DeepSeek V3.1 , 2025. https://api-docs.deepseek.com/news/news250821

work page 2025
[14]

State of ai-assisted software development: 2025 dora report

DORA Research Program . State of ai-assisted software development: 2025 dora report. https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report, September 2025. DevOps Research and Assessment (DORA)

work page 2025
[15]

Large language models for software engineering: Survey and open problems

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31--53. IEEE, 2023

work page 2023
[16]

The current challenges of software engineering in the era of large language models

Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. The current challenges of software engineering in the era of large language models. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025

work page 2025
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025

work page 2025
[19]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi DQ Bui. Codewiki: Evaluating ai's ability to generate holistic documentation for large-scale codebases. arXiv preprint arXiv:2510.24428, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Testgeneval: A real world unit test generation and test completion benchmark

Kush Jain, Gabriel Synnaeve, and Baptiste Roziere. Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752, 2024

work page arXiv 2024
[23]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[25]

A review on software maintenance issues and how to reduce maintenance efforts

Uttamjit Kaur and Gagandeep Singh. A review on software maintenance issues and how to reduce maintenance efforts. International Journal of Computer Applications, 118 0 (1): 0 6--11, 2015

work page 2015
[26]

Kimi K2: Open Agentic Intelligence

Kimi Team . Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

work page 2022
[28]

Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025

Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Shang Zhu Tarun Venkat, Ben Athiwaratkun, et al. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025

work page 2025
[29]

The vault: A comprehensive multilingual dataset for advancing code understanding and generation

Dung Nguyen Manh, Nam Le Hai, Anh TV Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, and Nghi DQ Bui. The vault: A comprehensive multilingual dataset for advancing code understanding and generation. arXiv preprint arXiv:2305.06156, 2023

work page arXiv 2023
[30]

Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms

Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025 a

work page 2025
[31]

Agilecoder: Dynamic collaborative agents for software development based on agile methodology

Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156--167. IEEE, 2025 b

work page 2025
[32]

Gpt-4o system card, 2024 a

OpenAI . Gpt-4o system card, 2024 a . URL https://openai.com/index/hello-gpt-4o/

work page 2024
[33]

Swe-bench verified, 2024 b

OpenAI . Swe-bench verified, 2024 b . https://openai.com/index/introducing-swe-bench-verified/

work page 2024
[34]

Introducing gpt-4.1 in the api, 2025 a

OpenAI . Introducing gpt-4.1 in the api, 2025 a . URL https://openai.com/index/gpt-4-1/

work page 2025
[35]

Gpt-5 system card

OpenAI . Gpt-5 system card. Technical report, OpenAI, 2025 b . URL https://cdn.openai.com/gpt-5-system-card.pdf

work page 2025
[36]

Openai o3 and o4-mini system card

OpenAI . Openai o3 and o4-mini system card. Technical report, OpenAI, 2025 c . URL https://openai.com/index/o3-o4-mini-system-card/

work page 2025
[37]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. URL https://arxiv.org/abs/2412.21139

work page internal anchor Pith review arXiv 2024
[38]

Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs

Minh VT Pham, Huy N Phan, Hoang N Phan, Cuong Le Chi, Tien N Nguyen, and Nghi DQ Bui. Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs. arXiv preprint arXiv:2504.14757, 2025

work page arXiv 2025
[39]

arXiv preprint arXiv:2409.16299

Huy Nhat Phan, Phong X. Nguyen, and Nghi D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale. CoRR, abs/2409.16299, 2024. doi:10.48550/ARXIV.2409.16299. URL https://doi.org/10.48550/arXiv.2409.16299

work page doi:10.48550/arxiv.2409.16299 2024
[40]

Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models, 2024

work page 2024
[41]

Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

work page 2025
[42]

Qwen3 Technical Report

Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

A self-improving coding agent.arXiv preprint arXiv:2504.15228,

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent. arXiv preprint arXiv:2504.15228, 2025

work page arXiv 2025
[44]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Scale AI . Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Analysis of software maintenance cost affecting factors and estimation models

Chamkaur Singh, Neeraj Sharma, and Narender Kumar. Analysis of software maintenance cost affecting factors and estimation models. Int. J. Sci. Technol. Res, 8 0 (9): 0 276--281, 2019

work page 2019
[46]

Functional overlap reranking for neural code generation

Hung Quoc To, Minh Huynh Nguyen, and Nghi DQ Bui. Functional overlap reranking for neural code generation. arXiv preprint arXiv:2311.03366, 2023

work page arXiv 2023
[47]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a

work page 2024
[48]

Testeval: Benchmarking large language models for test case generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531, 2024 b

work page arXiv 2024
[49]

u rgen Schmidhuber. Huxley-g \

Wenyi Wang, Piotr Pi e kos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and J \"u rgen Schmidhuber. Huxley-g \"o del machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025

work page 2025
[50]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024 c

work page 2024
[51]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024 d

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023

work page internal anchor Pith review arXiv 2023
[53]

Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e

Zhiruo Wang, Daniel Fried, and Graham Neubig. Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e

work page 2024
[54]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023

work page arXiv 2023
[55]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Conversational automated program repair,

Chunqiu Steven Xia and Lingming Zhang. Conversational automated program repair. arXiv preprint arXiv:2301.13246, 2023

work page arXiv 2023
[57]

Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt

Chunqiu Steven Xia and Lingming Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 819--831, 2024

work page 2024
[58]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

arXiv.org

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024 b

work page arXiv 2024
[61]

Evaluating and improving chatgpt for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 1703--1726, 2024

work page 2024
[62]

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving. arXiv preprint arXiv:2504.02605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents. arXiv preprint arXiv:2505.22954, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025 b

work page arXiv 2025
[66]

Available: https://arxiv.org/abs/2312.15223

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering. arXiv preprint arXiv:2312.15223, 2023

work page arXiv 2023
[67]

Autocoderover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592--1604, Vienna, Austria, 2024. ACM

work page 2024
[68]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023

work page internal anchor Pith review arXiv 2023
[69]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025

work page arXiv 2025

[4] [4]

Codetf: One-stop transformer library for state-of-the-art code llm.arXiv preprint arXiv:2306.00029, 2023

Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. Codetf: One-stop transformer library for state-of-the-art code llm. arXiv preprint arXiv:2306.00029, 2023

work page arXiv 2023

[5] [5]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. In International Conference on Representation Learning, 2024

work page 2024

[6] [6]

Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

work page arXiv 2025

[7] [7]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

CodeT: Code Generation with Generated Tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [11]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch \"a rli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [12]

Devin ai, 2024

Cognition . Devin ai, 2024. https://cognition.ai/blog/introducing-devin

work page 2024

[12] [13]

DeepSeek V3.1 , 2025

DeepSeek AI . DeepSeek V3.1 , 2025. https://api-docs.deepseek.com/news/news250821

work page 2025

[13] [14]

State of ai-assisted software development: 2025 dora report

DORA Research Program . State of ai-assisted software development: 2025 dora report. https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report, September 2025. DevOps Research and Assessment (DORA)

work page 2025

[14] [15]

Large language models for software engineering: Survey and open problems

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31--53. IEEE, 2023

work page 2023

[15] [16]

The current challenges of software engineering in the era of large language models

Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. The current challenges of software engineering in the era of large language models. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025

work page 2025

[16] [17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACM Transactions on Software Engineering and Methodology, 34 0 (5): 0 1--30, 2025

work page 2025

[18] [19]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [20]

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi DQ Bui. Codewiki: Evaluating ai's ability to generate holistic documentation for large-scale codebases. arXiv preprint arXiv:2510.24428, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [21]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [22]

Testgeneval: A real world unit test generation and test completion benchmark

Kush Jain, Gabriel Synnaeve, and Baptiste Roziere. Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752, 2024

work page arXiv 2024

[22] [23]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2024

[24] [25]

A review on software maintenance issues and how to reduce maintenance efforts

Uttamjit Kaur and Gagandeep Singh. A review on software maintenance issues and how to reduce maintenance efforts. International Journal of Computer Applications, 118 0 (1): 0 6--11, 2015

work page 2015

[25] [26]

Kimi K2: Open Agentic Intelligence

Kimi Team . Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [27]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

work page 2022

[27] [28]

Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025

Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Shang Zhu Tarun Venkat, Ben Athiwaratkun, et al. Deepswe: Training a fully open-sourced, state-of-the-art coding agent by scaling rl, 2025

work page 2025

[28] [29]

The vault: A comprehensive multilingual dataset for advancing code understanding and generation

Dung Nguyen Manh, Nam Le Hai, Anh TV Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, and Nghi DQ Bui. The vault: A comprehensive multilingual dataset for advancing code understanding and generation. arXiv preprint arXiv:2305.06156, 2023

work page arXiv 2023

[29] [30]

Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms

Dung Manh Nguyen, Thang Chau Phan, Nam Le Hai, Tien-Thong Doan, Nam V Nguyen, Quang Pham, and Nghi DQ Bui. Codemmlu: A multi-task benchmark for assessing code understanding & reasoning capabilities of codellms. In The Thirteenth International Conference on Learning Representations, 2025 a

work page 2025

[30] [31]

Agilecoder: Dynamic collaborative agents for software development based on agile methodology

Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156--167. IEEE, 2025 b

work page 2025

[31] [32]

Gpt-4o system card, 2024 a

OpenAI . Gpt-4o system card, 2024 a . URL https://openai.com/index/hello-gpt-4o/

work page 2024

[32] [33]

Swe-bench verified, 2024 b

OpenAI . Swe-bench verified, 2024 b . https://openai.com/index/introducing-swe-bench-verified/

work page 2024

[33] [34]

Introducing gpt-4.1 in the api, 2025 a

OpenAI . Introducing gpt-4.1 in the api, 2025 a . URL https://openai.com/index/gpt-4-1/

work page 2025

[34] [35]

Gpt-5 system card

OpenAI . Gpt-5 system card. Technical report, OpenAI, 2025 b . URL https://cdn.openai.com/gpt-5-system-card.pdf

work page 2025

[35] [36]

Openai o3 and o4-mini system card

OpenAI . Openai o3 and o4-mini system card. Technical report, OpenAI, 2025 c . URL https://openai.com/index/o3-o4-mini-system-card/

work page 2025

[36] [37]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. URL https://arxiv.org/abs/2412.21139

work page internal anchor Pith review arXiv 2024

[37] [38]

Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs

Minh VT Pham, Huy N Phan, Hoang N Phan, Cuong Le Chi, Tien N Nguyen, and Nghi DQ Bui. Swe-synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs. arXiv preprint arXiv:2504.14757, 2025

work page arXiv 2025

[38] [39]

arXiv preprint arXiv:2409.16299

Huy Nhat Phan, Phong X. Nguyen, and Nghi D. Q. Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale. CoRR, abs/2409.16299, 2024. doi:10.48550/ARXIV.2409.16299. URL https://doi.org/10.48550/arXiv.2409.16299

work page doi:10.48550/arxiv.2409.16299 2024

[39] [40]

Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models, 2024

work page 2024

[40] [41]

Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

work page 2025

[41] [42]

Qwen3 Technical Report

Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

A self-improving coding agent.arXiv preprint arXiv:2504.15228,

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent. arXiv preprint arXiv:2504.15228, 2025

work page arXiv 2025

[43] [44]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Scale AI . Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [45]

Analysis of software maintenance cost affecting factors and estimation models

Chamkaur Singh, Neeraj Sharma, and Narender Kumar. Analysis of software maintenance cost affecting factors and estimation models. Int. J. Sci. Technol. Res, 8 0 (9): 0 276--281, 2019

work page 2019

[45] [46]

Functional overlap reranking for neural code generation

Hung Quoc To, Minh Huynh Nguyen, and Nghi DQ Bui. Functional overlap reranking for neural code generation. arXiv preprint arXiv:2311.03366, 2023

work page arXiv 2023

[46] [47]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a

work page 2024

[47] [48]

Testeval: Benchmarking large language models for test case generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531, 2024 b

work page arXiv 2024

[48] [49]

u rgen Schmidhuber. Huxley-g \

Wenyi Wang, Piotr Pi e kos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and J \"u rgen Schmidhuber. Huxley-g \"o del machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025

work page 2025

[49] [50]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024 c

work page 2024

[50] [51]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024 d

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [52]

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023

work page internal anchor Pith review arXiv 2023

[52] [53]

Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e

Zhiruo Wang, Daniel Fried, and Graham Neubig. Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024 e

work page 2024

[53] [54]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023

work page arXiv 2023

[54] [55]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [56]

Conversational automated program repair,

Chunqiu Steven Xia and Lingming Zhang. Conversational automated program repair. arXiv preprint arXiv:2301.13246, 2023

work page arXiv 2023

[56] [57]

Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt

Chunqiu Steven Xia and Lingming Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 819--831, 2024

work page 2024

[57] [58]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [59]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [60]

arXiv.org

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains? arXiv preprint arXiv:2410.03859, 2024 b

work page arXiv 2024

[60] [61]

Evaluating and improving chatgpt for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 1703--1726, 2024

work page 2024

[61] [62]

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving. arXiv preprint arXiv:2504.02605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [63]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [64]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents. arXiv preprint arXiv:2505.22954, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [65]

Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live! arXiv preprint arXiv:2505.23419, 2025 b

work page arXiv 2025

[65] [66]

Available: https://arxiv.org/abs/2312.15223

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. A survey on large language models for software engineering. arXiv preprint arXiv:2312.15223, 2023

work page arXiv 2023

[66] [67]

Autocoderover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592--1604, Vienna, Austria, 2024. ACM

work page 2024

[67] [68]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023

work page internal anchor Pith review arXiv 2023

[68] [69]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024