Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

arxiv: 2510.15494 · v2 · submitted 2025-10-17 · 💻 cs.SE · cs.AI· cs.PF

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

Lirong Yi , Gregory Gay , Philipp Leitner This is my paper

Pith reviewed 2026-05-18 06:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PF

keywords large language modelsperformance optimizationempirical studyJava softwareJMH benchmarksreal-world codecode generationvolatility

0 comments p. Extension

The pith

LLMs can generate performance improvements for real Java production code but produce highly volatile results that lag human developers on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can improve the speed of actual performance-critical sections of open-source Java systems rather than isolated algorithmic puzzles. It draws 65 concrete tasks from real projects, each paired with developer-written JMH microbenchmarks, and measures the speed gains LLMs propose against the changes human developers made. The results show that models frequently produce working speedups, yet these gains fluctuate sharply across repeated trials and remain smaller than human improvements in most cases. The authors conclude that benchmarks built only on clean algorithmic problems therefore give an inflated view of what LLMs can achieve in production settings. They locate the shortfall in two specific weaknesses: models rarely locate the actual slowdowns without help, and even when shown the right locations they seldom invent the most effective algorithmic fixes.

Core claim

Although LLMs demonstrate a surprisingly high capability to solve these complex engineering problems, their solutions suffer from extreme volatility and still lag behind human developers on average. Consequently, we find that the current benchmarks based on algorithmic tasks yields an overly optimistic assessment of LLM capabilities. We trace this real-world performance gap to two primary limitations: first, LLMs struggle to autonomously pinpoint performance hotspots, and second, even with explicit guidance, they often fall short of synthesizing optimal algorithmic improvements.

What carries the argument

A collection of 65 performance-critical tasks mined from open-source Java projects, each validated with original developer-written JMH benchmarks, used to compare LLM-proposed optimizations directly against human baselines.

If this is right

Algorithmic puzzle benchmarks overestimate how well LLMs handle performance work in production software.
LLMs need additional runtime observation capabilities to locate slowdowns reliably.
Even when hotspots are identified for them, LLMs still fall short of the most effective algorithmic changes humans make.
Performance improvement systems should shift from static code generation toward agent-based approaches that can profile and react to runtime behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing LLMs with automated profiling tools could help close the gap in hotspot detection and reduce result volatility.
Repeating the study on performance-critical code in other languages could show whether the volatility and gap are Java-specific or general across languages.
A hybrid process in which humans flag hotspots and LLMs propose fixes might combine the strengths of both while avoiding full autonomy.

Load-bearing premise

The 65 mined tasks from performance-critical open-source Java projects, together with the developer-written JMH benchmarks, provide a representative and rigorous basis for comparing LLM and human performance improvements in real production code.

What would settle it

A follow-up experiment on the same or a larger set of real Java tasks in which LLMs given runtime profiles produce speedups that consistently match or exceed human baselines with low variance across runs would falsify the reported performance gap and volatility.

Figures

Figures reproduced from arXiv: 2510.15494 by Gregory Gay, Lirong Yi, Philipp Leitner.

**Figure 1.** Figure 1: Overview of the research methods. 2.1 Data Collection To construct a realistic and challenging evaluation suite, we curate the PerfOpt Dataset, a collection of 65 performance-oriented programming tasks sourced from the commit histories of , Vol. 1, No. 1, Article . Publication date: October 2025 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Automated patching with search/replace blocks generated by LLMs [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of 𝑝𝑠𝑠 scores for LLM-generated solutions, aggregated across all tasks. The left plot groups results by prompting strategy, while the right plot groups results by model. The red dashed line indicates a 𝑝𝑠𝑠 score of 1.0 (i.e., performance equal to the original code). The blue dashed line indicates the median 𝑝𝑠𝑠 score of the developer’s solutions across all tasks. If no additional information i… view at source ↗

**Figure 4.** Figure 4: Heatmap of Vargha-Delaney A12 effect size from pairwise comparisons of all configurations (as well [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: A representative example of a strategy match, alignment, and divergence. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Categorization of the similarity of LLM-generated solutions to developer solutions, broken down by [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of 𝑝𝑠𝑠 scores for each category, showing the performance impact of alignment. For visual clarity, outliers (identified using the IQR method [43]) are omitted from the boxplots [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) can generate code, but can they generate fast code for complex, real-world software systems? In this study, we investigate this question using a dataset of 65 tasks mined from performance-critical open-source Java projects. Unlike prior studies, which focused on algorithmic puzzles, we conduct experiments on actual performance-sensitive production code and employ developer-written JMH benchmarks to rigorously validate performance gains against human baselines. Our results reveal a nuanced reality -- although LLMs demonstrate a surprisingly high capability to solve these complex engineering problems, their solutions suffer from extreme volatility and still lag behind human developers on average. Consequently, we find that the current benchmarks based on algorithmic tasks yields an overly optimistic assessment of LLM capabilities. We trace this real-world performance gap to two primary limitations: first, LLMs struggle to autonomously pinpoint performance hotspots, and second, even with explicit guidance, they often fall short of synthesizing optimal algorithmic improvements. Our results highlight the need to move beyond static code generation towards more complex agent-based systems that are able to profile and observe runtime behavior for performance improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study comparing LLM-proposed performance optimizations against human baselines across 65 tasks mined from performance-critical open-source Java projects. The authors employ developer-written JMH microbenchmarks to measure actual runtime gains, finding that LLMs show surprisingly high success rates on these complex engineering tasks yet produce highly volatile outputs that lag human developers on average. They conclude that this real-world gap implies current algorithmic-task benchmarks overestimate LLM capabilities, attributing the shortfall to difficulties in autonomous hotspot detection and synthesis of optimal algorithmic changes, and advocate for agent-based systems incorporating runtime profiling.

Significance. If the core empirical findings hold after addressing sampling and baseline issues, the work offers a timely corrective to the prevailing evaluation paradigm in LLM code generation research. By shifting from synthetic puzzles to production Java code with rigorous JMH validation, it provides concrete evidence that LLM performance claims may not generalize, underscoring the value of realistic benchmarks and motivating more sophisticated agentic approaches. The use of real projects and quantitative human comparisons is a clear methodological strength.

major comments (2)

[§3.1] §3.1 (Task Mining): The criteria used to select the 65 performance-critical tasks are insufficiently specified. It remains unclear how 'performance-critical' was operationalized (e.g., via profiling data, commit messages, or issue reports), what project diversity was achieved (number of repositories, application domains, sizes), and whether tasks were filtered or favored by the pre-existence of JMH benchmarks. These details are load-bearing for the central claim that algorithmic benchmarks are overly optimistic, as a non-representative sample could produce the observed volatility and lag without implying a systematic overestimation.
[§4.2] §4.2–4.3 (Human Baselines): The construction of the human performance baseline is ambiguous. The manuscript does not state whether the reported human improvements derive from the original developer commits in the repositories or from independent expert developers given identical prompts, constraints, and time limits as the LLMs. This distinction directly affects whether the average lag and volatility can be interpreted as evidence that algorithmic benchmarks are optimistic rather than an artifact of mismatched comparison conditions.

minor comments (2)

[Abstract] Abstract, line 8: 'yields' should be 'yield' to agree with the plural subject 'benchmarks'.
[Table 2] Table 2: The volatility metric (standard deviation of speedups) would benefit from an explicit definition or formula in the caption or methods, as readers must currently infer it from context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These have prompted us to clarify key methodological details. We respond to each major comment below and indicate where revisions will be made to the next version of the paper.

read point-by-point responses

Referee: [§3.1] §3.1 (Task Mining): The criteria used to select the 65 performance-critical tasks are insufficiently specified. It remains unclear how 'performance-critical' was operationalized (e.g., via profiling data, commit messages, or issue reports), what project diversity was achieved (number of repositories, application domains, sizes), and whether tasks were filtered or favored by the pre-existence of JMH benchmarks. These details are load-bearing for the central claim that algorithmic benchmarks are overly optimistic, as a non-representative sample could produce the observed volatility and lag without implying a systematic overestimation.

Authors: We agree that §3.1 would benefit from greater specificity to support the generalizability of our findings. The tasks were mined by searching commit histories and linked issue reports in open-source Java repositories for mentions of performance optimizations or hotspots. In the revised manuscript we will explicitly state the operationalization (commit messages and issue reports), report the number of repositories and their domains (e.g., data processing, web infrastructure), provide summary statistics on project sizes, and confirm that pre-existing JMH benchmarks were a deliberate selection filter to enable rigorous runtime validation. We maintain that this sampling strategy targets authentic performance-engineering work rather than synthetic problems, thereby reinforcing rather than undermining the claim that algorithmic benchmarks overestimate LLM capabilities. revision: yes
Referee: [§4.2] §4.2–4.3 (Human Baselines): The construction of the human performance baseline is ambiguous. The manuscript does not state whether the reported human improvements derive from the original developer commits in the repositories or from independent expert developers given identical prompts, constraints, and time limits as the LLMs. This distinction directly affects whether the average lag and volatility can be interpreted as evidence that algorithmic benchmarks are optimistic rather than an artifact of mismatched comparison conditions.

Authors: The human baselines reported in §§4.2–4.3 are taken directly from the original developer commits that introduced the performance improvements in the studied repositories. These represent real, deployed human solutions rather than new experiments with independent experts under LLM-style constraints. We selected this comparison to evaluate LLM proposals against authentic production changes. In the revision we will add explicit wording in §§4.2–4.3 stating the source of the baselines and briefly discuss why this real-world reference is appropriate for assessing whether algorithmic benchmarks are overly optimistic. revision: yes

Circularity Check

0 steps flagged

Empirical study with no circular derivations or self-referential results

full rationale

This is a straightforward empirical reporting study that mines 65 tasks from open-source Java projects, applies LLMs to propose performance improvements, and compares results against developer-written JMH benchmarks and human baselines. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains are present in the abstract or described methodology. The central claim that algorithmic benchmarks are overly optimistic is an interpretive conclusion drawn from direct experimental comparisons on the collected dataset rather than any reduction to inputs by construction. The study is self-contained as an observational report; representativeness concerns affect external validity but do not create circularity in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical investigation that relies on standard software-engineering experimental practices rather than new theoretical constructs or fitted parameters.

axioms (1)

domain assumption Developer-written JMH benchmarks provide an accurate and unbiased measure of performance differences
Invoked when validating LLM suggestions against human baselines

pith-pipeline@v0.9.0 · 5728 in / 1122 out tokens · 36252 ms · 2026-05-18T06:36:35.283502+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We collect a dataset of 65 real performance-improving changes from four large and performance-sensitive Java projects... employ developer-written JMH benchmarks to rigorously validate performance gains against human baselines.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use a two-sided Wilcoxon signed-rank test... Vargha-Delaney A12 effect size

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits
cs.SE 2026-05 accept novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Sarah Abdulsalam, Ziliang Zong, Qijun Gu, and Meikang Qiu. 2015. Using the Greenup, Powerup, and Speedup metrics to evaluate software energy efficiency. In2015 Sixth International Green and Sustainable Computing Conference (IGSC). 1–8. doi:10.1109/IGCC.2015.7393699

work page doi:10.1109/igcc.2015.7393699 2015
[2]

Ali Abedi and Tim Brecht. 2017. Conducting Repeatable Experiments in Highly Variable Cloud Computing Environ- ments. InProceedings of the 8th ACM/SPEC on International Conference on Performance Engineering(L’Aquila, Italy) (ICPE ’17). Association for Computing Machinery, New York, NY, USA, 287–292. doi:10.1145/3030207.3030229

work page doi:10.1145/3030207.3030229 2017
[3]

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large Language Models for Mathematical Reasoning: Progresses and Challenges. arXiv:2402.00157 [cs.CL] https://arxiv.org/abs/2402.00157

work page arXiv 2024
[4]

Wajdi Aljedaani, Abdulrahman Habib, Ahmed Aljohani, Marcelo Eler, and Yunhe Feng. 2024. Does ChatGPT Generate Accessible Code? Investigating Accessibility Challenges in LLM-Generated Source Code. InProceedings of the 21st International Web for All Conference(Singapore, Singapore)(W4A ’24). Association for Computing Machinery, New York, NY, USA, 165–176. d...

work page doi:10.1145/3677846.3677854 2024
[5]

Ando, Chi-Kwong Li, and Roy Mathias

T. Ando, Chi-Kwong Li, and Roy Mathias. 2004. Geometric means.Linear Algebra Appl.385 (2004), 305–334. doi:10. 1016/j.laa.2003.11.019 Special Issue in honor of Peter Lancaster

work page 2004
[6]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: a critical review and guidelines. Empirical Softw. Engg.27, 4 (July 2022), 31 pages. doi:10.1007/s10664-021-10072-8

work page doi:10.1007/s10664-021-10072-8 2022
[8]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

work page internal anchor Pith review arXiv 2024
[9]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. arXiv:2207.10397 [cs.CL] https://arxiv.org/abs/2207.10397

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. ChatUniTest: A Framework for LLM-Based Test Generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations 17https://github.com/icetlab/EvalLLMforJava , Vol. 1, No. 1, Article . Publication date: October 2025. 20 Lirong Yi, Gregory Gay, a...

work page doi:10.1145/3663529.3663801 2024
[12]

Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A Performance Study of LLM-Generated Code on Leetcode. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (Salerno, Italy)(EASE ’24). Association for Computing Machinery, New York, NY, USA, 79–89. doi:10.1145/3661167. 3661221

work page doi:10.1145/3661167 2024
[13]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. 2024. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. arXiv:2401.03065 [cs.SE] https://arxiv.org/abs/2401. 03065

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Xue Han and Tingting Yu. 2016. An Empirical Study on Performance Bugs for Highly Configurable Software Systems. InProceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Ciudad Real, Spain)(ESEM ’16). Association for Computing Machinery, New York, NY, USA, Article 23, 10 pages. doi:10.1145/2961111.2962602

work page doi:10.1145/2961111.2962602 2016
[15]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] https://arxiv.org/abs/2105.09938

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M. Zhang. 2024. EffiBench: Benchmarking the Efficiency of Automatically Generated Code. InAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 11506–11544. https://proceedings....

work page 2024
[17]

Baskhad Idrisov, Esther Eisenacher, and Tim Schlippe. 2025. Program Code Generation: Single LLMs vs. Multi-Agent Systems. In2025 7th International Conference on Natural Language Processing (ICNLP). 121–127. doi:10.1109/ICNLP65360. 2025.11108400

work page doi:10.1109/icnlp65360 2025
[18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/ 2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs.SIGPLAN Not.47, 6 (June 2012), 77–88. doi:10.1145/2345156.2254075

work page doi:10.1145/2345156.2254075 2012
[20]

Ranim Khojah, Francisco Gomes Oliveira Neto, Mazen Mohamad, and Philipp Leitner. 2025. The Impact of Prompt Programming on Function-Level Code Generation.IEEE Transactions on Software Engineering (TSE)(2025)

work page 2025
[21]

Gall, and Philipp Leitner

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. Dynamically reconfiguring software microbenchmarks: reducing execution time without sacrificing result quality. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA)(ES...

work page arXiv 2020
[22]

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (Nov. 2019), 56–65. doi:10.1145/3318162

work page doi:10.1145/3318162 2019
[23]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[24]

Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, and Hui Liu. 2025. Exploring the potential of general purpose LLMs in automated software refactoring: an empirical study.Automated Software Engg.32, 1 (March 2025), 42 pages. doi:10.1007/s10515-025-00500-0

work page doi:10.1007/s10515-025-00500-0 2025
[25]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024. Evaluating Language Models for Efficient Code Generation. arXiv:2408.06450 [cs.SE] https://arxiv.org/abs/2408.06450

work page arXiv 2024
[26]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. arXiv:2306.03091 [cs.CL] https://arxiv.org/abs/2306.03091

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 97, 13 pages. doi:10. 1145/3597503.3639187

work page arXiv 2024
[28]

Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, and Vincent Ng. 2024. On Evaluating the Efficiency of Source Code Generated by LLMs. InProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and , Vol. 1, No. 1, Article . Publication date: October 2025. An Experimental Study of Real-Life LLM-Proposed Performance Improvements ...

work page doi:10.1145/3650105.3652295 2024
[29]

Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Håkan Sivencrona, and Christian Berger. 2024. Engineering Safety Requirements for Autonomous Driving with Large Language Models. In2024 IEEE 32nd International Requirements Engineering Conference (RE). 218–228. doi:10.1109/RE59067.2024.00029

work page doi:10.1109/re59067.2024.00029 2024
[30]

Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo

Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). 1–13. doi:10.1109/Forge66646.2025. 00008

work page doi:10.1109/forge66646.2025 2025
[32]

Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. 2025. COFFE: A Code Efficiency Benchmark for Code Generation. Proc. ACM Softw. Eng.2, FSE, Article FSE012 (June 2025), 24 pages. doi:10.1145/3715727

work page doi:10.1145/3715727 2025
[33]

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. InProceedings of the 2015 International Symposium on Software Testing and Analysis(Baltimore, MD, USA)(ISSTA 2015). Association for Computing Machinery, New York, NY, USA, 24–36. doi:10.1145/27717...

work page doi:10.1145/2771783.2771791 2015
[34]

Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, and Hanghang Tong. 2025. How Efficient is LLM- Generated Code? A Rigorous & High-Standard Benchmark. arXiv:2406.06647 [cs.SE] https://arxiv.org/abs/2406.06647

work page arXiv 2025
[35]

Hazem Samoaa and Philipp Leitner. 2021. An Exploratory Study of the Impact of Parameterization on JMH Measurement Results in Open-Source Projects. InProceedings of the ACM/SPEC International Conference on Performance Engineering (Virtual Event, France)(ICPE ’21). Association for Computing Machinery, New York, NY, USA, 213–224. doi:10.1145/ 3427921.3450243

work page arXiv 2021
[36]

Advait Sarkar and Ian Drosos. 2025. Vibe coding: programming through conversation with artificial intelligence. arXiv:2506.23253 [cs.HC] https://arxiv.org/abs/2506.23253

work page arXiv 2025
[37]

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka Watanobe. 2023. Refactoring Programs Using Large Language Models with Few-Shot Examples. In2023 30th Asia-Pacific Software Engineering Conference (APSEC). 151–160. doi:10.1109/APSEC60848.2023.00025

work page doi:10.1109/apsec60848.2023.00025 2023
[38]

S.E. Sim, S. Easterbrook, and R.C. Holt. 2003. Using benchmarking to advance research: a challenge to software engineering. In25th International Conference on Software Engineering, 2003. Proceedings.74–83. doi:10.1109/ICSE.2003. 1201189

work page doi:10.1109/icse.2003 2003
[39]

John Smith. 1991. Performance engineering of software systems.Journal of the Operational Research Society42, 10 (1991), 903–904

work page 1991
[40]

Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2023. Towards effective assessment of steady state performance in Java software: Are we there yet?Empirical Software Engineering28, 1 (2023), 13. doi:10.1007/s10664- 022-10247-x

work page doi:10.1007/s10664- 2023
[41]

András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132. doi:10.3102/ 10769986025002101 arXiv:https://doi.org/10.3102/10769986025002101

work page doi:10.3102/10769986025002101 2000
[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proc...

work page 2017
[43]

Xiang Wan, Wenqian Wang, Jiming Liu, and Tiejun Tong. 2014. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range.BMC medical research methodology14, 1 (2014), 135. doi:10.1186/1471-2288-14-135

work page doi:10.1186/1471-2288-14-135 2014
[44]

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. 2020. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2020
[45]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), pp. 80–83. http://www.jstor.org/stable/3001968

work page arXiv 1945
[46]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre- trained Language Models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1482–1494. doi:10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023
[47]

Nan Xu and Xuezhe Ma. 2025. LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems. arXiv:2410.14166 [cs.CL] https://arxiv.org/abs/2410.14166 , Vol. 1, No. 1, Article . Publication date: October 2025. 22 Lirong Yi, Gregory Gay, and Philipp Leitner

work page arXiv 2025
[48]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

work page
[49]

InAdvances in Neural Information Processing Systems, A

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 50528–50652. https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5a7c947568c1b1328ccc5230172e1e7...

work page 2024
[50]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machiner...

work page doi:10.1145/3597503.3623316 2024
[51]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. arXiv:2303.12570 [cs.CL] https://arxiv.org/abs/2303.12570

work page arXiv 2023
[52]

Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. 2024. Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models. arXiv:2407.11470 [cs.SE] https://arxiv.org/abs/2407.11470

work page arXiv 2024
[53]

Li Zhong and Zilong Wang. 2024. Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation.Proceedings of the AAAI Conference on Artificial Intelligence38, 19 (Mar. 2024), 21841–21849. doi:10.1609/aaai.v38i19.30185 Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 , Vol. 1, No. 1, Art...

work page doi:10.1609/aaai.v38i19.30185 2024

[1] [1]

Sarah Abdulsalam, Ziliang Zong, Qijun Gu, and Meikang Qiu. 2015. Using the Greenup, Powerup, and Speedup metrics to evaluate software energy efficiency. In2015 Sixth International Green and Sustainable Computing Conference (IGSC). 1–8. doi:10.1109/IGCC.2015.7393699

work page doi:10.1109/igcc.2015.7393699 2015

[2] [2]

Ali Abedi and Tim Brecht. 2017. Conducting Repeatable Experiments in Highly Variable Cloud Computing Environ- ments. InProceedings of the 8th ACM/SPEC on International Conference on Performance Engineering(L’Aquila, Italy) (ICPE ’17). Association for Computing Machinery, New York, NY, USA, 287–292. doi:10.1145/3030207.3030229

work page doi:10.1145/3030207.3030229 2017

[3] [3]

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large Language Models for Mathematical Reasoning: Progresses and Challenges. arXiv:2402.00157 [cs.CL] https://arxiv.org/abs/2402.00157

work page arXiv 2024

[4] [4]

Wajdi Aljedaani, Abdulrahman Habib, Ahmed Aljohani, Marcelo Eler, and Yunhe Feng. 2024. Does ChatGPT Generate Accessible Code? Investigating Accessibility Challenges in LLM-Generated Source Code. InProceedings of the 21st International Web for All Conference(Singapore, Singapore)(W4A ’24). Association for Computing Machinery, New York, NY, USA, 165–176. d...

work page doi:10.1145/3677846.3677854 2024

[5] [5]

Ando, Chi-Kwong Li, and Roy Mathias

T. Ando, Chi-Kwong Li, and Roy Mathias. 2004. Geometric means.Linear Algebra Appl.385 (2004), 305–334. doi:10. 1016/j.laa.2003.11.019 Special Issue in honor of Peter Lancaster

work page 2004

[6] [6]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: a critical review and guidelines. Empirical Softw. Engg.27, 4 (July 2022), 31 pages. doi:10.1007/s10664-021-10072-8

work page doi:10.1007/s10664-021-10072-8 2022

[8] [8]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

work page internal anchor Pith review arXiv 2024

[9] [9]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. arXiv:2207.10397 [cs.CL] https://arxiv.org/abs/2207.10397

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. ChatUniTest: A Framework for LLM-Based Test Generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations 17https://github.com/icetlab/EvalLLMforJava , Vol. 1, No. 1, Article . Publication date: October 2025. 20 Lirong Yi, Gregory Gay, a...

work page doi:10.1145/3663529.3663801 2024

[12] [12]

Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A Performance Study of LLM-Generated Code on Leetcode. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (Salerno, Italy)(EASE ’24). Association for Computing Machinery, New York, NY, USA, 79–89. doi:10.1145/3661167. 3661221

work page doi:10.1145/3661167 2024

[13] [13]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. 2024. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. arXiv:2401.03065 [cs.SE] https://arxiv.org/abs/2401. 03065

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Xue Han and Tingting Yu. 2016. An Empirical Study on Performance Bugs for Highly Configurable Software Systems. InProceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Ciudad Real, Spain)(ESEM ’16). Association for Computing Machinery, New York, NY, USA, Article 23, 10 pages. doi:10.1145/2961111.2962602

work page doi:10.1145/2961111.2962602 2016

[15] [15]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] https://arxiv.org/abs/2105.09938

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M. Zhang. 2024. EffiBench: Benchmarking the Efficiency of Automatically Generated Code. InAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 11506–11544. https://proceedings....

work page 2024

[17] [17]

Baskhad Idrisov, Esther Eisenacher, and Tim Schlippe. 2025. Program Code Generation: Single LLMs vs. Multi-Agent Systems. In2025 7th International Conference on Natural Language Processing (ICNLP). 121–127. doi:10.1109/ICNLP65360. 2025.11108400

work page doi:10.1109/icnlp65360 2025

[18] [18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/ 2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs.SIGPLAN Not.47, 6 (June 2012), 77–88. doi:10.1145/2345156.2254075

work page doi:10.1145/2345156.2254075 2012

[20] [20]

Ranim Khojah, Francisco Gomes Oliveira Neto, Mazen Mohamad, and Philipp Leitner. 2025. The Impact of Prompt Programming on Function-Level Code Generation.IEEE Transactions on Software Engineering (TSE)(2025)

work page 2025

[21] [21]

Gall, and Philipp Leitner

Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. Dynamically reconfiguring software microbenchmarks: reducing execution time without sacrificing result quality. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA)(ES...

work page arXiv 2020

[22] [22]

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (Nov. 2019), 56–65. doi:10.1145/3318162

work page doi:10.1145/3318162 2019

[23] [23]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022

[24] [24]

Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, and Hui Liu. 2025. Exploring the potential of general purpose LLMs in automated software refactoring: an empirical study.Automated Software Engg.32, 1 (March 2025), 42 pages. doi:10.1007/s10515-025-00500-0

work page doi:10.1007/s10515-025-00500-0 2025

[25] [25]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024. Evaluating Language Models for Efficient Code Generation. arXiv:2408.06450 [cs.SE] https://arxiv.org/abs/2408.06450

work page arXiv 2024

[26] [26]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. arXiv:2306.03091 [cs.CL] https://arxiv.org/abs/2306.03091

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help With Code Understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 97, 13 pages. doi:10. 1145/3597503.3639187

work page arXiv 2024

[28] [28]

Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, and Vincent Ng. 2024. On Evaluating the Efficiency of Source Code Generated by LLMs. InProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and , Vol. 1, No. 1, Article . Publication date: October 2025. An Experimental Study of Real-Life LLM-Proposed Performance Improvements ...

work page doi:10.1145/3650105.3652295 2024

[29] [29]

Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Håkan Sivencrona, and Christian Berger. 2024. Engineering Safety Requirements for Autonomous Driving with Large Language Models. In2024 IEEE 32nd International Requirements Engineering Conference (RE). 218–228. doi:10.1109/RE59067.2024.00029

work page doi:10.1109/re59067.2024.00029 2024

[30] [30]

Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo

Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). 1–13. doi:10.1109/Forge66646.2025. 00008

work page doi:10.1109/forge66646.2025 2025

[31] [32]

Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. 2025. COFFE: A Code Efficiency Benchmark for Code Generation. Proc. ACM Softw. Eng.2, FSE, Article FSE012 (June 2025), 24 pages. doi:10.1145/3715727

work page doi:10.1145/3715727 2025

[32] [33]

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. InProceedings of the 2015 International Symposium on Software Testing and Analysis(Baltimore, MD, USA)(ISSTA 2015). Association for Computing Machinery, New York, NY, USA, 24–36. doi:10.1145/27717...

work page doi:10.1145/2771783.2771791 2015

[33] [34]

Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, and Hanghang Tong. 2025. How Efficient is LLM- Generated Code? A Rigorous & High-Standard Benchmark. arXiv:2406.06647 [cs.SE] https://arxiv.org/abs/2406.06647

work page arXiv 2025

[34] [35]

Hazem Samoaa and Philipp Leitner. 2021. An Exploratory Study of the Impact of Parameterization on JMH Measurement Results in Open-Source Projects. InProceedings of the ACM/SPEC International Conference on Performance Engineering (Virtual Event, France)(ICPE ’21). Association for Computing Machinery, New York, NY, USA, 213–224. doi:10.1145/ 3427921.3450243

work page arXiv 2021

[35] [36]

Advait Sarkar and Ian Drosos. 2025. Vibe coding: programming through conversation with artificial intelligence. arXiv:2506.23253 [cs.HC] https://arxiv.org/abs/2506.23253

work page arXiv 2025

[36] [37]

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka Watanobe. 2023. Refactoring Programs Using Large Language Models with Few-Shot Examples. In2023 30th Asia-Pacific Software Engineering Conference (APSEC). 151–160. doi:10.1109/APSEC60848.2023.00025

work page doi:10.1109/apsec60848.2023.00025 2023

[37] [38]

S.E. Sim, S. Easterbrook, and R.C. Holt. 2003. Using benchmarking to advance research: a challenge to software engineering. In25th International Conference on Software Engineering, 2003. Proceedings.74–83. doi:10.1109/ICSE.2003. 1201189

work page doi:10.1109/icse.2003 2003

[38] [39]

John Smith. 1991. Performance engineering of software systems.Journal of the Operational Research Society42, 10 (1991), 903–904

work page 1991

[39] [40]

Luca Traini, Vittorio Cortellessa, Daniele Di Pompeo, and Michele Tucci. 2023. Towards effective assessment of steady state performance in Java software: Are we there yet?Empirical Software Engineering28, 1 (2023), 13. doi:10.1007/s10664- 022-10247-x

work page doi:10.1007/s10664- 2023

[40] [41]

András Vargha and Harold D. Delaney. 2000. A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132. doi:10.3102/ 10769986025002101 arXiv:https://doi.org/10.3102/10769986025002101

work page doi:10.3102/10769986025002101 2000

[41] [42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proc...

work page 2017

[42] [43]

Xiang Wan, Wenqian Wang, Jiming Liu, and Tiejun Tong. 2014. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range.BMC medical research methodology14, 1 (2014), 135. doi:10.1186/1471-2288-14-135

work page doi:10.1186/1471-2288-14-135 2014

[43] [44]

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. 2020. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2020

[44] [45]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), pp. 80–83. http://www.jstor.org/stable/3001968

work page arXiv 1945

[45] [46]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre- trained Language Models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1482–1494. doi:10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023

[46] [47]

Nan Xu and Xuezhe Ma. 2025. LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems. arXiv:2410.14166 [cs.CL] https://arxiv.org/abs/2410.14166 , Vol. 1, No. 1, Article . Publication date: October 2025. 22 Lirong Yi, Gregory Gay, and Philipp Leitner

work page arXiv 2025

[47] [48]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

work page

[48] [49]

InAdvances in Neural Information Processing Systems, A

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 50528–50652. https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5a7c947568c1b1328ccc5230172e1e7...

work page 2024

[49] [50]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machiner...

work page doi:10.1145/3597503.3623316 2024

[50] [51]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. arXiv:2303.12570 [cs.CL] https://arxiv.org/abs/2303.12570

work page arXiv 2023

[51] [52]

Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. 2024. Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models. arXiv:2407.11470 [cs.SE] https://arxiv.org/abs/2407.11470

work page arXiv 2024

[52] [53]

Li Zhong and Zilong Wang. 2024. Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation.Proceedings of the AAAI Conference on Artificial Intelligence38, 19 (Mar. 2024), 21841–21849. doi:10.1609/aaai.v38i19.30185 Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 , Vol. 1, No. 1, Art...

work page doi:10.1609/aaai.v38i19.30185 2024