Evaluating LLMs on Real-World Software Performance Optimization
Pith reviewed 2026-06-25 20:07 UTC · model grok-4.3
The pith
Current LLMs produce negligible runtime gains and almost no memory reductions on real repository tasks, while experts achieve 15.5x speedups and 171.3x peak-memory cuts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-Pro evaluation reveals that LLMs achieve negligible runtime gains and nearly nonexistent memory optimizations across the 102 tasks, whereas the original expert patches produce an aggregate 15.5x speedup and 171.3x peak-memory reduction, with expert improvements appearing in 91.2 percent of runtime cases and 65.7 percent of peak-memory cases.
What carries the argument
SWE-Pro benchmark, which supplies each optimization task with parameterized tests and a noise-aware measurement protocol for runtime, peak memory, and Time-Weighted Memory Usage.
If this is right
- LLMs currently cannot replace expert engineers on repository-level performance work.
- Single-function or single-metric benchmarks miss the trade-offs and noise that dominate real optimization.
- Progress on LLM code agents will require benchmarks that enforce multi-metric, multi-input evaluation under realistic measurement conditions.
- Expert patches remain the only reliable source of large performance wins on these tasks.
Where Pith is reading between the lines
- Models may need explicit exposure to profiling data and memory-layout reasoning before they can close the observed gap.
- Future benchmarks could add cost models that penalize both time and memory simultaneously rather than treating them separately.
- If SWE-Pro tasks are added to training corpora, measured gains on the benchmark itself would need to be checked against held-out projects to guard against overfitting.
Load-bearing premise
The 102 expert-written optimizations collected from open-source projects form a representative proxy for the full complexity of real-world repository-level performance optimization.
What would settle it
Running the same LLMs on the SWE-Pro tasks and obtaining aggregate speedups and memory reductions comparable to the expert baseline of 15.5x and 171.3x.
Figures
read the original abstract
Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and the variability introduced by different input data and execution conditions. We address this by introducing SWE-Pro, a repository-level benchmark derived from 102 expert-written optimizations from open-source projects. Unlike previous benchmarks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage (TWMU) across varying input data and execution conditions under noise-aware measurement conditions. Our evaluation shows that current LLMs struggle significantly: runtime gains are negligible, and memory optimizations are nearly non-existent. This stands in sharp contrast to expert implementations, which achieve an aggregate speedup of 15.5x and peak memory reduction of 171.3x over benchmark tasks. Expert-written improvements are observed in 91.2% of tasks for runtime and 65.7% for peak memory. Our findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SWE-Pro, a repository-level benchmark derived from 102 expert-written optimizations collected from open-source projects. Each task is paired with parameterized tests to measure runtime, peak memory, and Time-Weighted Memory Usage (TWMU) under varying inputs and noise-aware conditions. Evaluation of current LLMs shows negligible runtime gains and near-absent memory optimizations, contrasting with expert implementations that deliver 15.5x aggregate speedup, 171.3x peak memory reduction, and improvements in 91.2% (runtime) and 65.7% (memory) of tasks.
Significance. If the benchmark tasks prove representative and the quantitative results hold after full methodological disclosure, the work would demonstrate a clear capability gap between current LLMs and expert-level repository performance optimization. The parameterized tests and explicit noise-aware protocol are positive features that increase realism over prior single-function or single-metric benchmarks. The findings could usefully direct future research on LLM code refinement toward handling trade-offs and measurement variability.
major comments (3)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): the selection criteria, domain coverage, and verification process for the 102 expert optimizations are not stated. This is load-bearing for the central claim because the reported LLM-expert gap (negligible vs. 15.5x / 171.3x) cannot be interpreted without evidence that the tasks were not filtered for cases already known to admit large expert gains.
- [Abstract and Evaluation section] Abstract and Evaluation section: the paper reports specific quantitative outcomes (15.5x speedup, 91.2% improvement rate, etc.) but supplies no information on which LLMs were tested, the prompting strategies employed, the exact statistical methods for aggregating results, or how measurement noise was quantified and thresholded. These omissions prevent verification or reproduction of the claim that LLMs achieve only negligible gains.
- [Evaluation section] Evaluation section: the description of the noise-aware measurement protocol and parameterized tests lacks quantitative detail on input-parameter ranges, number of repetitions, or the precise definition of 'negligible' gains. Without these, it is impossible to assess whether the expert baselines are robust or whether the LLM results are sensitive to the chosen noise model.
minor comments (1)
- [Abstract] The acronym TWMU is introduced without an explicit expansion on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive comments identifying areas requiring greater methodological transparency. We will revise the manuscript to incorporate the requested details on benchmark construction, LLM evaluation setup, and measurement protocols. This will strengthen the interpretability of the LLM-expert performance gap without altering the core findings.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the selection criteria, domain coverage, and verification process for the 102 expert optimizations are not stated. This is load-bearing for the central claim because the reported LLM-expert gap (negligible vs. 15.5x / 171.3x) cannot be interpreted without evidence that the tasks were not filtered for cases already known to admit large expert gains.
Authors: We acknowledge the omission in the current draft. In the revised §3, we will add explicit selection criteria (commits with measurable performance impact from open-source repos, diversity across domains like databases, ML pipelines, and web servers), domain coverage breakdown (e.g., 35% data-intensive, 28% compute-bound), and verification process (independent review by two authors plus automated test validation ensuring parameterized tests pass on original and optimized code). This will confirm representative sampling without post-hoc filtering for large gains. revision: yes
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the paper reports specific quantitative outcomes (15.5x speedup, 91.2% improvement rate, etc.) but supplies no information on which LLMs were tested, the prompting strategies employed, the exact statistical methods for aggregating results, or how measurement noise was quantified and thresholded. These omissions prevent verification or reproduction of the claim that LLMs achieve only negligible gains.
Authors: The manuscript draft lacks these specifics. We will expand the Evaluation section to list the exact LLMs (GPT-4o, Claude-3.5-Sonnet, Llama-3-70B, etc.), prompting strategies (zero-shot with repository context, chain-of-thought, and retrieval-augmented examples), aggregation methods (geometric mean speedups with 95% bootstrap CIs), and noise thresholding (gains <2% after subtracting 1-sigma measurement variance classified as negligible). This enables full reproduction. revision: yes
-
Referee: [Evaluation section] Evaluation section: the description of the noise-aware measurement protocol and parameterized tests lacks quantitative detail on input-parameter ranges, number of repetitions, or the precise definition of 'negligible' gains. Without these, it is impossible to assess whether the expert baselines are robust or whether the LLM results are sensitive to the chosen noise model.
Authors: We agree additional quantitative detail is needed. The revision will specify input-parameter ranges (e.g., array sizes 10^3 to 10^6, concurrency levels 1-32), repetitions (minimum 20 runs per configuration with outlier rejection), and 'negligible' definition (runtime/memory change <5% relative to noise floor, where noise floor is std dev across repeated measurements on identical binaries). We will also add a sensitivity table showing results under alternative noise models. revision: yes
Circularity Check
Empirical benchmark comparison with no derivational chain
full rationale
The paper introduces SWE-Pro as a new repository-level benchmark constructed from 102 expert optimizations and evaluates LLMs against expert baselines using parameterized tests and noise-aware measurements. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The central claims are direct empirical observations (LLM gains negligible vs. expert 15.5x/171.3x) on the collected tasks; representativeness is an external validity concern, not a circular reduction of the reported results to their own construction. This matches the default case of a self-contained empirical study.
Axiom & Free-Parameter Ledger
free parameters (1)
- Task selection criteria for the 102 optimizations
axioms (1)
- domain assumption Parameterized tests under noise-aware conditions accurately capture real-world performance variability and trade-offs.
Reference graph
Works this paper leans on
-
[1]
SWE-perf: Can language models optimize code performance on real-world repos- itories? InSubmitted to The Fourteenth International Conference on Learning Representations,
Anonymous. SWE-perf: Can language models optimize code performance on real-world repos- itories? InSubmitted to The Fourteenth International Conference on Learning Representations,
-
[2]
Claude sonnet 4.6
Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , 2026
2026
-
[3]
Prasanna Balaprakash, Ananta Tiwari, and Stefan M. Wild. Multi objective optimization of hpc kernels for performance, power, and energy. InPMBS@SC, 2013
2013
-
[4]
John Wiley & Sons, Ltd, 2017
Phillip Borman and David Elder.Q2(R1) Validation of Analytical Procedures, chapter 5, pages 127–166. John Wiley & Sons, Ltd, 2017
2017
-
[5]
Mercury: a code efficiency benchmark for code large language models
Mingzhe Du, Luu Anh Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. Mercury: a code efficiency benchmark for code large language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc
2024
-
[6]
Lyu.Search-Based LLMs for Code Optimization, page 578–590
Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu.Search-Based LLMs for Code Optimization, page 578–590. IEEE Press, 2025
2025
-
[7]
Statistically rigorous java performance evaluation
Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous java performance evaluation. InProceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, OOPSLA ’07, page 57–76, New York, NY , USA, 2007. Association for Computing Machinery. 10
2007
-
[8]
Language models for code optimization: Survey, challenges and future directions, 2025
Jingzhi Gong, Vardan V oskanyan, Paul Brookes, Fan Wu, Wei Jie, Jie Xu, Rafail Giavrimis, Mike Basios, Leslie Kanthan, and Zheng Wang. Language models for code optimization: Survey, challenges and future directions, 2025
2025
-
[9]
Gemini 3.1 flash
Google DeepMind. Gemini 3.1 flash. https://deepmind.google/technologies/ gemini/, 2026
2026
-
[10]
Effibench: Bench- marking the efficiency of automatically generated code
Dong HUANG, Yuhao QING, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Bench- marking the efficiency of automatically generated code. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
2024
-
[11]
Monzurul Amin Ifath and Israat Haque
Md. Monzurul Amin Ifath and Israat Haque. Characterizing performance–energy trade-offs of large language models in multi-request workflows.Proc. ACM Meas. Anal. Comput. Syst., 10(1), March 2026
2026
-
[12]
Bruce Jacob and Trevor N. Mudge. Notes on calculating computer performance. Technical Report CSE-TR-231-95, University of Michigan, EECS Department, Advanced Computer Architecture Lab, 1995. Technical Report
1995
-
[13]
A survey on large language models for code generation.ACM Trans
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Trans. Softw. Eng. Methodol., 35(2), January 2026
2026
-
[14]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024
2024
-
[15]
Understanding and detecting real-world performance bugs
Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. Understanding and detecting real-world performance bugs. InProceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, page 77–88, New York, NY , USA, 2012. Association for Computing Machinery
2012
-
[16]
Rigorous benchmarking in reasonable time
Tomas Kalibera and Richard Jones. Rigorous benchmarking in reasonable time. InProceedings of the 2013 International Symposium on Memory Management, ISMM ’13, page 63–74, New York, NY , USA, 2013. Association for Computing Machinery
2013
-
[17]
Gall, and Philipp Leitner
Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. Dynamically reconfig- uring software microbenchmarks: reducing execution time without sacrificing result quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 989–1...
2020
-
[18]
Input sensitivity on the performance of configurable systems an empirical study.Journal of Systems and Software, 201:111671, 2023
Luc Lesoil, Mathieu Acher, Arnaud Blouin, and Jean-Marc Jézéquel. Input sensitivity on the performance of configurable systems an empirical study.Journal of Systems and Software, 201:111671, 2023
2023
-
[19]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. Pyserini: An easy-to-use python toolkit to support replicable IR research with sparse and dense representations.CoRR, abs/2102.10073, 2021
-
[20]
Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025
Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025
2025
-
[21]
MiniMax M27.https://www.minimax.io/news/minimax-m25, 2026
MiniMax. MiniMax M27.https://www.minimax.io/news/minimax-m25, 2026
2026
-
[22]
Kimi K2.5
Moonshot AI. Kimi K2.5. https://platform.kimi.ai/docs/guide/ kimi-k2-5-quickstart, 2026
2026
-
[23]
Analyzing the impact of workloads on modeling the performance of configurable software systems
Stefan Mühlbauer, Florian Sattler, Christian Kaltenecker, Johannes Dorn, Sven Apel, and Norbert Siegmund. Analyzing the impact of workloads on modeling the performance of configurable software systems. InProceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 2085–2097. IEEE Press, 2023. 11
2085
-
[24]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Producing wrong data without doing anything obviously wrong! InProceedings of the 14th International Confer- ence on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV , page 265–276, New York, NY , USA, 2009. Association for Computing Machinery
2009
-
[25]
Discovering, reporting, and fixing performance bugs
Adrian Nistor, Tian Jiang, and Lin Tan. Discovering, reporting, and fixing performance bugs. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 237–246, 2013
2013
-
[26]
Introducing GPT-5.2
OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ , 2025
2025
-
[27]
Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo
Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Perfcodegen: Improving performance of llm generated code with execution feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 1–13, 2025
2025
-
[28]
Coffe: A code efficiency benchmark for code generation, 2025
Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. Coffe: A code efficiency benchmark for code generation, 2025
2025
-
[29]
Rui Pereira, Marco Couto, Francisco Ribeiro, Rui Rua, Jácome Cunha, João Paulo Fernandes, and João Saraiva. Energy efficiency across programming languages: how do energy, time, and memory relate? InProceedings of the 10th ACM SIGPLAN International Conference on Software Language Engineering, SLE 2017, page 256–267, New York, NY , USA, 2017. Association fo...
2017
-
[30]
tracemalloc — Trace memory allocations
Python Software Foundation. tracemalloc — Trace memory allocations. https://docs. python.org/3/library/tracemalloc.html, 2026. Python Standard Library, version 3.14.3 documentation
2026
-
[31]
Zhang, Heming Cui, Siu Ming Yiu, Dong HUANG, See-Kiong Ng, and Anh Tuan Luu
Yuhao QING, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu Ming Yiu, Dong HUANG, See-Kiong Ng, and Anh Tuan Luu. Effibench-x: A multi-language benchmark for measuring efficiency of LLM-generated code. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
2025
-
[32]
How efficient is llm-generated code? a rigorous & high-standard benchmark, 2025
Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, and Hanghang Tong. How efficient is llm-generated code? a rigorous & high-standard benchmark, 2025
2025
-
[33]
A large-scale empirical study on mobile performance: energy, run-time and memory.Empirical Softw
Rui Rua and João Saraiva. A large-scale empirical study on mobile performance: energy, run-time and memory.Empirical Softw. Engg., 29(1), December 2023
2023
-
[34]
Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E
Ripon K. Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E. Perry. Improving bug localization using structured information retrieval. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 345–355, 2013
2013
-
[35]
Gso: Challenging software optimization tasks for evaluating swe-agents, 2025
Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents, 2025
2025
-
[36]
Learning performance-improving code edits, 2024
Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdanbakhsh. Learning performance-improving code edits, 2024
2024
-
[37]
Nicolas van Kempen, Hyukje Kwon, Dung Trung Nguyen, and E. Berger. It’s not easy being green: On the energy efficiency of programming languages.2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1553–1565, 2024
2025
-
[38]
Ecco: Can we improve model-generated code efficiency without sacrificing functional correctness?, 2024
Siddhant Waghjale, Vishruth Veerendranath, Zora Zhiruo Wang, and Daniel Fried. Ecco: Can we improve model-generated code efficiency without sacrificing functional correctness?, 2024
2024
-
[39]
Perfgen: Automated performance benchmark generation for big data analytics, 2024
Jiyuan Wang, Jason Teoh, Muhammand Ali Gulza, Qian Zhang, and Miryung Kim. Perfgen: Automated performance benchmark generation for big data analytics, 2024
2024
-
[40]
GLM-51.https://z.ai/blog/glm-5.1, 2026
Z.ai. GLM-51.https://z.ai/blog/glm-5.1, 2026. 12
2026
-
[41]
Yutong Zhao, Lu Xiao, Xiao Wang, Lei Sun, Bihuan Chen, Yang Liu, and Andre B. Bondi. How are performance issues caused and resolved?-an empirical study from a design perspective. InProceedings of the ACM/SPEC International Conference on Performance Engineering, ICPE ’20, page 181–192, New York, NY , USA, 2020. Association for Computing Machinery
2020
-
[42]
"" PR #50778 - Benchmark Series.replace when using a large dict for to_replace. Parameters: - N: Series length - R: Number of values to replace
Fida Zubair, Maryam Al-Hitmi, and Cagatay Catal. The use of large language models for program repair.Comput. Stand. Interfaces, 93(C), April 2025. A Dataset Construction and Statistics This section provides implementation-level details of the dataset construction process, complementing Section 2. We focus on repository-specific configurations and the resu...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.