JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java

Khashayar Etemadi; Zhendong Su

arxiv: 2606.31767 · v1 · pith:F4V2IZY7new · submitted 2026-06-30 · 💻 cs.SE

JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java

Khashayar Etemadi , Zhendong Su This is my paper

Pith reviewed 2026-07-01 04:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords execution time improvement patchesJava benchmarksperformance bug fixingreproducible benchmarksautomated program repairGitHub miningDocker containersstatistical verification

0 comments

The pith

A configurable tool builds the first reusable benchmarks of execution time patches in real-world Java projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents JETO-Mine as a reusable tool that scans open-source Java repositories to locate patches improving execution time. Its pipeline applies user-defined filters and an LLM classifier in static analysis, then wraps candidates in Docker containers for dynamic statistical testing of time improvements, and supplies an evaluation harness for repair agents and tests. Applied across 11 years and nearly 1.8 million commits from 174 repositories, the tool yields JETO-Bench containing 660 identified and 91 manually verified executable cases. Unlike fixed datasets limited to other languages, JETO-Mine supports ongoing collection with custom configurations. Evaluation of one leading coding agent on the verified cases shows a 14.3 percent success rate and notes the general absence of performance tests in these projects.

Core claim

JETO-Mine is the first configurable and reusable tool for automatically creating reproducible benchmarks of execution time improvement patches in real-world Java projects. It employs a three-phase pipeline: static analysis that crawls GitHub repositories and identifies candidates using user-defined filters and an LLM-based issue classifier, dynamic analysis that wraps the identified patches in Docker images for fully reproducible execution and performs statistical testing to find objective evidence of execution time improvement, and an evaluation harness that enables quantitative assessment of both generated patches and generated tests. Using JETO-Mine to scan 11 years of development history

What carries the argument

JETO-Mine three-phase pipeline that identifies ETIPs via static analysis with LLM classification, verifies them through Docker-wrapped dynamic statistical testing, and supplies an evaluation harness.

If this is right

Researchers can reuse JETO-Mine with arbitrary filters and statistical thresholds to collect new benchmarks indefinitely.
The 91 cases allow quantitative evaluation of automated repair tools, with one agent succeeding on 13 of them.
Open-source Java projects rarely ship tests that capture execution time improvements.
The approach extends beyond fixed benchmarks by enabling continuous addition of new patches.
The lack of performance tests points to a gap that test-generation methods could address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mining pipeline could be ported to other languages to produce comparable cross-language performance benchmarks.
The verified cases could be mined further to extract recurring patterns in how developers optimize Java execution time.
The benchmark supplies a concrete testbed for measuring how well repair agents generalize when the target language changes.
Future extensions might integrate automated generation of performance-focused tests directly into the harness.

Load-bearing premise

The LLM-based issue classifier in the static analysis phase and the subsequent manual verification step together produce a low rate of false-positive ETIPs, such that the 91 executable cases are representative of genuine execution time improvements.

What would settle it

Re-executing the dynamic analysis phase on the 91 cases and finding that a majority fail to show statistically significant execution time reductions would show the benchmark contains many non-genuine patches.

Figures

Figures reproduced from arXiv: 2606.31767 by Khashayar Etemadi, Zhendong Su.

**Figure 2.** Figure 2: The static analysis workflow in JETO-Mine. criteria ensure that the project is active and popular according to user definition, and uses a build tool that enables reproducibility. After the desired repositories are selected, JETO-Mine checks the commit diffs on the default branch over the timeline specified by the user (step 2). A commit diff is considered valid if it 1) only changes .java files, 2) does n… view at source ↗

**Figure 3.** Figure 3: The dynamic analysis phase of JETO-Mine. detects a statistically significant difference between the execution times of the two versions. The logs and statistical test results are presented to the user as the final report. In addition to patch evaluation, JETO-Mine also supports the evaluation of test generation tools. For this purpose, JETO-Mine takes a generated test class and runs it on the original and … view at source ↗

**Figure 4.** Figure 4: The number of identified ETIPs and executable [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: The year in which the ETIPs in JETO-Bench are created. 5.2 Results for RQ2 (using JETO-Bench in practice) As explained in Section 4.4, before running OpenHands on JETOBench, we first manually check the executable ETIPs in JETOBench and their corresponding issues. We only keep the issues that are focused on execution time improvement and have informative description. The curated set contains 91 issues and… view at source ↗

**Figure 7.** Figure 7: Characteristics of Docker images built by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Automated fixing of performance issues is gaining increasing attention. However, existing benchmarks of execution time improvement patches are fixed datasets that target Python, C++, or .NET and cannot be extended to new patches according to user-defined configurations. In this paper, we present JETO-Mine, the first configurable and reusable tool for automatically creating reproducible benchmarks of execution time improvement patches (ETIPs) in real-world Java projects. JETO-Mine employs a three-phase pipeline: a static analysis phase that crawls GitHub repositories and identifies ETIPs using user-defined filters and an LLM-based issue classifier, a dynamic analysis phase that wraps the identified ETIPs in Docker images for fully reproducible execution and performs statistical testing to find objective evidence of execution time improvement, and an evaluation harness that enables quantitative assessment of both generated patches and generated tests. Unlike existing benchmarks, JETO-Mine is designed as a reusable tool that allows researchers continuously collect new benchmarks with their own desired filters and statistical rigor levels. We use JETO-Mine to build JETO-Bench, a benchmark of 660 identified ETIPs and 91 manually verified executable ETIPs collected from 174 open-source Java repositories. To build JETO-Bench, JETO-Mine scans 11 years of open-source development history and nearly 1.8 million commits. We run OpenHands, a leading open-source coding agent, on the 91 manually verified executable ETIPs in JETO-Bench and find that it correctly fixes 14.3% (13/91) of the issues, aligning with results reported by similar studies on other programming languages. Our results also reveal that open-source Java projects largely lack tests that demonstrate execution time improvements, presenting an opportunity for future research in test generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JETO-Mine gives a reusable Docker-based pipeline for mining and verifying Java performance patches, but the LLM classifier and manual verification steps lack any reported accuracy or agreement numbers.

read the letter

The paper's main contribution is JETO-Mine, a three-phase tool that crawls GitHub, applies user filters plus an LLM classifier to spot potential execution time patches, wraps candidates in Docker for reproducible runs, and applies statistical tests to confirm actual speedups. They used it on 1.8 million commits to produce JETO-Bench with 91 manually verified cases and then ran OpenHands on them for a 14.3% fix rate.

The reusable design and the Docker-plus-stats approach are the parts that work. Fixed benchmarks for other languages cannot grow with new data or different thresholds, so a configurable pipeline with explicit reproducibility steps fills a practical gap. The scale of the scan and the use of an external agent for the evaluation are straightforward and grounded in public data.

The soft spot is the middle of the pipeline. No precision or recall is given for the LLM classifier, and the manual verification has no inter-rater numbers or explicit rules for what counts as a confirmed time improvement after the dynamic test. If false positives are common, the 91-case set and the agent result lose their value for comparison.

This is for people who build or evaluate repair tools aimed at performance bugs in Java. Readers who need fresh, adaptable benchmark data or who care about mining reproducibility will find the pipeline description useful.

Send it to peer review. The core construction is worth checking in detail, and the gaps in reporting are the kind referees can address with concrete requests for numbers and protocols.

Referee Report

2 major / 1 minor

Summary. The paper introduces JETO-Mine, a configurable and reusable tool for mining reproducible benchmarks of execution time improvement patches (ETIPs) from real-world Java projects via a three-phase pipeline (static analysis with GitHub crawling, user filters, and LLM issue classifier; dynamic analysis with Docker containers and statistical testing; evaluation harness). Applied to 174 repositories and 1.8M commits, it yields JETO-Bench containing 660 identified ETIPs and 91 manually verified executable ones. Evaluation of the OpenHands agent on the 91 cases shows a 14.3% (13/91) fix rate, consistent with prior work on other languages, while noting the scarcity of performance-improvement tests in Java projects.

Significance. If the verification pipeline reliably isolates genuine ETIPs, the work supplies the first extensible Java benchmark in this space, enabling researchers to generate custom datasets with controllable rigor. The reported fix rate provides a comparable baseline across languages, and the identified gap in test coverage for performance patches is a concrete direction for follow-on research. The tool's design as reusable infrastructure (rather than a one-time dataset) is a notable strength.

major comments (2)

[Abstract and static analysis phase] Abstract and §3 (static analysis phase): the LLM-based issue classifier is central to producing the 660 candidates, yet no precision, recall, or validation metrics against a labeled subset are reported. Without these, it is impossible to bound the false-positive rate that propagates into the 91 verified cases.
[Abstract and dynamic analysis phase] Abstract and §4 (dynamic analysis and manual verification): the manuscript states that statistical testing yields 'objective evidence of execution time improvement' and that 91 cases are 'manually verified,' but supplies neither the exact statistical procedure/thresholds, inter-rater agreement statistics, nor explicit exclusion criteria. These details are load-bearing for the claim that JETO-Bench contains representative, genuine ETIPs.

minor comments (1)

[Abstract] The abstract could more explicitly state how many of the 660 candidates survived the dynamic statistical test versus how many were removed during manual verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that greater transparency on the LLM classifier validation and on the statistical/manual verification procedures is warranted. We respond point-by-point below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Abstract and static analysis phase] Abstract and §3 (static analysis phase): the LLM-based issue classifier is central to producing the 660 candidates, yet no precision, recall, or validation metrics against a labeled subset are reported. Without these, it is impossible to bound the false-positive rate that propagates into the 91 verified cases.

Authors: We agree that precision, recall, and related metrics for the LLM-based issue classifier are necessary to quantify its contribution to the candidate pool and to bound downstream false positives. In the revised version we will add a dedicated evaluation subsection in §3 that reports results on a manually labeled subset (including precision, recall, F1, and a confusion matrix). This addition will directly address the concern. revision: yes
Referee: [Abstract and dynamic analysis phase] Abstract and §4 (dynamic analysis and manual verification): the manuscript states that statistical testing yields 'objective evidence of execution time improvement' and that 91 cases are 'manually verified,' but supplies neither the exact statistical procedure/thresholds, inter-rater agreement statistics, nor explicit exclusion criteria. These details are load-bearing for the claim that JETO-Bench contains representative, genuine ETIPs.

Authors: We concur that the exact statistical procedures, thresholds, exclusion criteria, and verification process must be documented. We will expand §4 with: (i) the precise statistical tests and thresholds applied (e.g., test name, p-value cutoff, minimum relative improvement), (ii) the full list of exclusion criteria used to arrive at the 91 executable cases, and (iii) a description of the manual verification workflow. If multiple raters participated we will add inter-rater agreement statistics; if verification was performed by a single author we will state this explicitly as a limitation. These clarifications will be added without altering the reported numbers. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction with no circular derivations

full rationale

The paper describes construction of JETO-Mine and JETO-Bench via crawling GitHub, LLM-based classification, Docker-wrapped dynamic statistical testing, and manual verification of 91 ETIPs. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain. Central results (91 cases, 14.3% fix rate on OpenHands) rest on external public repositories and an independent agent, satisfying the self-contained benchmark criterion. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the reliability of the LLM classifier and statistical verification steps; no free parameters or invented entities are introduced.

axioms (2)

domain assumption An LLM can be used to classify GitHub issues as describing execution time improvements with sufficient accuracy for benchmark construction.
Invoked in the static analysis phase to filter commits.
domain assumption Docker-wrapped execution plus statistical testing provides objective evidence of execution time improvement.
Central to the dynamic analysis phase.

pith-pipeline@v0.9.1-grok · 5854 in / 1492 out tokens · 43729 ms · 2026-07-01T04:08:33.201784+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 18 canonical work pages · 4 internal anchors

[1]

Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco A Gerosa
[2]

Applying large language models to issue classification: Revisiting with extended data and new models.Science of Computer Programming246 (2025), 103333

2025
[3]

Stephen M Blackburn, Robin Garner, Chris Hoffmann, Asjad M Khang, Kathryn S McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer, et al. 2006. The DaCapo benchmarks: Java benchmarking devel- opment and analysis. InProceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and a...

2006
[4]

Aymeric Blot and Justyna Petke. 2025. A comprehensive survey of benchmarks for improvement of software’s non-functional properties.Comput. Surveys57, 7 (2025), 1–36

2025
[5]

Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. InProceedings of the 28th international conference on evaluation and assessment in software engineering. 79–89

2024
[6]

Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Syn- naeve, et al . 2023. Large language models for compiler optimization.arXiv preprint arXiv:2309.07062(2023)

work page arXiv 2023
[7]

Mingzhe Du, Luu A Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. 2024. Mercury: A code efficiency benchmark for code large language models.Advances in Neural Information Processing Systems37 (2024), 16601–16622

2024
[8]

2016.IntroClassJava: A benchmark of 297 small and buggy Java programs

Thomas Durieux and Martin Monperrus. 2016.IntroClassJava: A benchmark of 297 small and buggy Java programs. Ph. D. Dissertation. Universite Lille 1

2016
[9]

Khashayar Etemadi, Nicolas Harrand, Simon Larsen, Haris Adzemovic, Henry Lu- ong Phu, Ashutosh Verma, Fernanda Madeiral, Douglas Wikström, and Martin Monperrus. 2022. Sorald: Automatic patch suggestions for sonarqube static analysis violations.IEEE Transactions on Dependable and Secure Computing20, 4 (2022), 2794–2810

2022
[10]

2026.Replication package.https://github

Khashayar Etemadi and Zhendong Su. 2026.Replication package.https://github. com/khesoem/JETO-Bench

2026
[11]

Zhang, Pinjia He, and Ahmed E

Zhiyuan Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. 2025. SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Con- straints.ArXivabs/2509.09853 (2025). https://api.semanticscholar.org/CorpusID: 281310248

work page arXiv 2025
[12]

Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung
[13]

InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering

Vulrepair: a t5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 935–947
[14]

Qing Gao, Yingfei Xiong, Yaqing Mi, Lu Zhang, Weikun Yang, Zhaoping Zhou, Bing Xie, and Hong Mei. 2015. Safe memory-leak fixing for c programs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 459–470

2015
[15]

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. 2025.Search- Based LLMs for Code Optimization. IEEE Press, 578–590. https://doi.org/10.1109/ ICSE55347.2025.00021

work page arXiv 2025
[16]

Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B Clement, Neel Sun- daresan, and Chen Wu. 2022. Deepdev-perf: a deep learning-based approach for improving software performance. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 948–958

2022
[17]

Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2025. PerfBench: Can Agents Resolve Real-World Performance Bugs?arXiv preprint arXiv:2509.24091(2025)

work page arXiv 2025
[18]

Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2025. Rapgen: An approach for fixing code inefficiencies in zero-shot. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 124–135

2025
[19]

Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous java performance evaluation.ACM SIGPLAN Notices42, 10 (2007), 57–76

2007
[20]

Mohammadreza Ghanavati, Diego Costa, Janos Seboek, David Lo, and Artur Andrzejak. 2020. Memory and resource leak defects and their repairs in Java projects.Empirical Software Engineering25, 1 (2020), 678–718

2020
[21]

Jingzhi Gong, Vardan Voskanyan, Paul Brookes, Fan Wu, Wei Jie, Jie Xu, Rafail Giavrimis, Mike Basios, Leslie Kanthan, and Zheng Wang. 2025. Language models for code optimization: Survey, challenges and future directions.arXiv preprint arXiv:2501.01277(2025)

work page arXiv 2025
[22]

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. 2025. Swe-perf: Can language models optimize code performance on real-world repositories?arXiv preprint arXiv:2507.12415(2025)

work page arXiv 2025
[23]

Dong HUANG, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao QING, Heming Cui, Zhijiang Guo, and Jie Zhang. 2024. EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=KhwOuB0fs9

2024
[24]

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M Zhang. 2024. Ef- fibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems37 (2024), 11506–11544

2024
[25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?ArXivabs/2310.06770 (2023). https://api.semanticscholar. org/CorpusID:263829697

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of ex- isting faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440

2014
[27]

Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. InProceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56

2017
[28]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024. Evaluating language models for efficient code generation.arXiv preprint arXiv:2408.06450(2024)

work page arXiv 2024
[29]

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. 2025. SWE- fficiency: Can Language Models Optimize Real-World Repositories on Real Work- loads?arXiv preprint arXiv:2511.06090(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. 2023. Learning Performance-Improving Code Edits.ArXivabs/2302.07867 (2023). https: //api.semanticscholar.org/CorpusID:256868633

work page arXiv 2023
[31]

Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An extensible java bug benchmark for automatic program repair studies. In2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, 468–478

2019
[32]

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels?. InForty-second International Conference on Machine Learning. https: //openreview.net/forum?id=yeoN1iQT1x

2025
[33]

Huiyun Peng, Antonio Zhong, Ricardo Andrés Calvo Méndez, Kelechi G Kalu, and James C Davis. 2025. How Do Agents Perform Code Optimization? An Empirical Study.arXiv preprint arXiv:2512.21757(2025)

work page arXiv 2025
[34]

Xiaoxue Ren, Jun Wan, Yun Peng, Zhongxin Liu, Ming Liang, Dajun Chen, Wei Jiang, and Yong Li. 2025. PEACE: Towards Efficient Project-Level Efficiency Optimization via Hybrid Code Editing.arXiv preprint arXiv:2510.17142(2025)

work page arXiv 2025
[35]

Ripon K Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R Prasad. 2018. Bugs. jar: A large-scale, diverse dataset of real-world java bugs. InProceedings of the 15th international conference on mining software repositories. 10–13

2018
[36]

Shaunak Samant. 2025. Syntax Is Not Enough: An Empirical Study of Small Transformer Models for Neural Code Repair.arXiv preprint arXiv:2512.22216 (2025)

work page arXiv 2025
[37]

Marija Selakovic and Michael Pradel. 2015. Poster: Automatically fixing real- world JavaScript performance bugs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 811–812

2015
[38]

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. 2025. Gso: Challenging software optimization tasks for evaluating swe-agents.arXiv preprint arXiv:2505.23671(2025)

work page arXiv 2025
[39]

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. Nofuneval: Funny how code lms falter on requirements beyond functional correctness.arXiv preprint arXiv:2401.15963(2024)

work page arXiv 2024
[40]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM)28, 4 (2019), 1–29

2019
[41]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025
[42]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

work page doi:10.1145/3715754 2025
[43]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024
[44]

Jiuding Yang, Shengyao Lu, Hongxuan Liu, Shayan Shirahmad Gale Bagi, Zahra Fazel, Tomasz Czajkowski, and Di Niu. 2025. PerfCoder: Large Language Models for Interpretable Code Performance Optimization.arXiv preprint arXiv:2512.14018 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural program repair with execution-based backpropagation. InProceedings of the 44th international conference on software engineering. 1506–1518. Conference’17, July 2017, Washington, DC, USA Khashayar Etemadi and Zhendong Su

2022
[46]

Lirong Yi, Gregory Gay, and Philipp Leitner. 2025. An Experimental Study of Real- Life LLM-Proposed Performance Improvements.arXiv preprint arXiv:2510.15494 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen
[48]

A survey of learning-based automated program repair.ACM Transactions on Software Engineering and Methodology33, 2 (2023), 1–69

2023
[49]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- tocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

2024

[1] [1]

Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, and Marco A Gerosa

[2] [2]

Applying large language models to issue classification: Revisiting with extended data and new models.Science of Computer Programming246 (2025), 103333

2025

[3] [3]

Stephen M Blackburn, Robin Garner, Chris Hoffmann, Asjad M Khang, Kathryn S McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer, et al. 2006. The DaCapo benchmarks: Java benchmarking devel- opment and analysis. InProceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and a...

2006

[4] [4]

Aymeric Blot and Justyna Petke. 2025. A comprehensive survey of benchmarks for improvement of software’s non-functional properties.Comput. Surveys57, 7 (2025), 1–36

2025

[5] [5]

Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. InProceedings of the 28th international conference on evaluation and assessment in software engineering. 79–89

2024

[6] [6]

Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Syn- naeve, et al . 2023. Large language models for compiler optimization.arXiv preprint arXiv:2309.07062(2023)

work page arXiv 2023

[7] [7]

Mingzhe Du, Luu A Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. 2024. Mercury: A code efficiency benchmark for code large language models.Advances in Neural Information Processing Systems37 (2024), 16601–16622

2024

[8] [8]

2016.IntroClassJava: A benchmark of 297 small and buggy Java programs

Thomas Durieux and Martin Monperrus. 2016.IntroClassJava: A benchmark of 297 small and buggy Java programs. Ph. D. Dissertation. Universite Lille 1

2016

[9] [9]

Khashayar Etemadi, Nicolas Harrand, Simon Larsen, Haris Adzemovic, Henry Lu- ong Phu, Ashutosh Verma, Fernanda Madeiral, Douglas Wikström, and Martin Monperrus. 2022. Sorald: Automatic patch suggestions for sonarqube static analysis violations.IEEE Transactions on Dependable and Secure Computing20, 4 (2022), 2794–2810

2022

[10] [10]

2026.Replication package.https://github

Khashayar Etemadi and Zhendong Su. 2026.Replication package.https://github. com/khesoem/JETO-Bench

2026

[11] [11]

Zhang, Pinjia He, and Ahmed E

Zhiyuan Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. 2025. SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Con- straints.ArXivabs/2509.09853 (2025). https://api.semanticscholar.org/CorpusID: 281310248

work page arXiv 2025

[12] [12]

Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung

[13] [13]

InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering

Vulrepair: a t5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 935–947

[14] [14]

Qing Gao, Yingfei Xiong, Yaqing Mi, Lu Zhang, Weikun Yang, Zhaoping Zhou, Bing Xie, and Hong Mei. 2015. Safe memory-leak fixing for c programs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 459–470

2015

[15] [15]

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. 2025.Search- Based LLMs for Code Optimization. IEEE Press, 578–590. https://doi.org/10.1109/ ICSE55347.2025.00021

work page arXiv 2025

[16] [16]

Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B Clement, Neel Sun- daresan, and Chen Wu. 2022. Deepdev-perf: a deep learning-based approach for improving software performance. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 948–958

2022

[17] [17]

Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2025. PerfBench: Can Agents Resolve Real-World Performance Bugs?arXiv preprint arXiv:2509.24091(2025)

work page arXiv 2025

[18] [18]

Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. 2025. Rapgen: An approach for fixing code inefficiencies in zero-shot. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 124–135

2025

[19] [19]

Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically rigorous java performance evaluation.ACM SIGPLAN Notices42, 10 (2007), 57–76

2007

[20] [20]

Mohammadreza Ghanavati, Diego Costa, Janos Seboek, David Lo, and Artur Andrzejak. 2020. Memory and resource leak defects and their repairs in Java projects.Empirical Software Engineering25, 1 (2020), 678–718

2020

[21] [21]

Jingzhi Gong, Vardan Voskanyan, Paul Brookes, Fan Wu, Wei Jie, Jie Xu, Rafail Giavrimis, Mike Basios, Leslie Kanthan, and Zheng Wang. 2025. Language models for code optimization: Survey, challenges and future directions.arXiv preprint arXiv:2501.01277(2025)

work page arXiv 2025

[22] [22]

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. 2025. Swe-perf: Can language models optimize code performance on real-world repositories?arXiv preprint arXiv:2507.12415(2025)

work page arXiv 2025

[23] [23]

Dong HUANG, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao QING, Heming Cui, Zhijiang Guo, and Jie Zhang. 2024. EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=KhwOuB0fs9

2024

[24] [24]

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M Zhang. 2024. Ef- fibench: Benchmarking the efficiency of automatically generated code.Advances in Neural Information Processing Systems37 (2024), 11506–11544

2024

[25] [25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?ArXivabs/2310.06770 (2023). https://api.semanticscholar. org/CorpusID:263829697

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of ex- isting faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440

2014

[27] [27]

Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. InProceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56

2017

[28] [28]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024. Evaluating language models for efficient code generation.arXiv preprint arXiv:2408.06450(2024)

work page arXiv 2024

[29] [29]

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. 2025. SWE- fficiency: Can Language Models Optimize Real-World Repositories on Real Work- loads?arXiv preprint arXiv:2511.06090(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. 2023. Learning Performance-Improving Code Edits.ArXivabs/2302.07867 (2023). https: //api.semanticscholar.org/CorpusID:256868633

work page arXiv 2023

[31] [31]

Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An extensible java bug benchmark for automatic program repair studies. In2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, 468–478

2019

[32] [32]

Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels?. InForty-second International Conference on Machine Learning. https: //openreview.net/forum?id=yeoN1iQT1x

2025

[33] [33]

Huiyun Peng, Antonio Zhong, Ricardo Andrés Calvo Méndez, Kelechi G Kalu, and James C Davis. 2025. How Do Agents Perform Code Optimization? An Empirical Study.arXiv preprint arXiv:2512.21757(2025)

work page arXiv 2025

[34] [34]

Xiaoxue Ren, Jun Wan, Yun Peng, Zhongxin Liu, Ming Liang, Dajun Chen, Wei Jiang, and Yong Li. 2025. PEACE: Towards Efficient Project-Level Efficiency Optimization via Hybrid Code Editing.arXiv preprint arXiv:2510.17142(2025)

work page arXiv 2025

[35] [35]

Ripon K Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R Prasad. 2018. Bugs. jar: A large-scale, diverse dataset of real-world java bugs. InProceedings of the 15th international conference on mining software repositories. 10–13

2018

[36] [36]

Shaunak Samant. 2025. Syntax Is Not Enough: An Empirical Study of Small Transformer Models for Neural Code Repair.arXiv preprint arXiv:2512.22216 (2025)

work page arXiv 2025

[37] [37]

Marija Selakovic and Michael Pradel. 2015. Poster: Automatically fixing real- world JavaScript performance bugs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 811–812

2015

[38] [38]

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. 2025. Gso: Challenging software optimization tasks for evaluating swe-agents.arXiv preprint arXiv:2505.23671(2025)

work page arXiv 2025

[39] [39]

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. Nofuneval: Funny how code lms falter on requirements beyond functional correctness.arXiv preprint arXiv:2401.15963(2024)

work page arXiv 2024

[40] [40]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM)28, 4 (2019), 1–29

2019

[41] [41]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025

[42] [42]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

work page doi:10.1145/3715754 2025

[43] [43]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

2024

[44] [44]

Jiuding Yang, Shengyao Lu, Hongxuan Liu, Shayan Shirahmad Gale Bagi, Zahra Fazel, Tomasz Czajkowski, and Di Niu. 2025. PerfCoder: Large Language Models for Interpretable Code Performance Optimization.arXiv preprint arXiv:2512.14018 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural program repair with execution-based backpropagation. InProceedings of the 44th international conference on software engineering. 1506–1518. Conference’17, July 2017, Washington, DC, USA Khashayar Etemadi and Zhendong Su

2022

[46] [46]

Lirong Yi, Gregory Gay, and Philipp Leitner. 2025. An Experimental Study of Real- Life LLM-Proposed Performance Improvements.arXiv preprint arXiv:2510.15494 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

[48] [48]

A survey of learning-based automated program repair.ACM Transactions on Software Engineering and Methodology33, 2 (2023), 1–69

2023

[49] [49]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- tocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

2024