pith. sign in

arxiv: 2606.25530 · v1 · pith:MSJ7FJ2Cnew · submitted 2026-06-24 · 💻 cs.SE · cs.AI· cs.CL

Evaluating LLMs on Real-World Software Performance Optimization

Pith reviewed 2026-06-25 20:07 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords software performance optimizationLLM code evaluationrepository-level benchmarkruntime optimizationmemory optimizationexpert-written patchesnoise-aware measurement
0
0 comments X

The pith

Current LLMs produce negligible runtime gains and almost no memory reductions on real repository tasks, while experts achieve 15.5x speedups and 171.3x peak-memory cuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-Pro, a benchmark built from 102 expert optimizations drawn from open-source projects, each paired with parameterized tests that measure runtime, peak memory, and time-weighted memory usage under noisy execution conditions. Evaluation of existing LLMs on these tasks shows only trivial runtime improvements and virtually no memory gains. In direct contrast, the same expert changes deliver large aggregate improvements and succeed on the great majority of tasks. The work therefore demonstrates a sizable gap between current model capabilities and the demands of actual performance engineering.

Core claim

SWE-Pro evaluation reveals that LLMs achieve negligible runtime gains and nearly nonexistent memory optimizations across the 102 tasks, whereas the original expert patches produce an aggregate 15.5x speedup and 171.3x peak-memory reduction, with expert improvements appearing in 91.2 percent of runtime cases and 65.7 percent of peak-memory cases.

What carries the argument

SWE-Pro benchmark, which supplies each optimization task with parameterized tests and a noise-aware measurement protocol for runtime, peak memory, and Time-Weighted Memory Usage.

If this is right

  • LLMs currently cannot replace expert engineers on repository-level performance work.
  • Single-function or single-metric benchmarks miss the trade-offs and noise that dominate real optimization.
  • Progress on LLM code agents will require benchmarks that enforce multi-metric, multi-input evaluation under realistic measurement conditions.
  • Expert patches remain the only reliable source of large performance wins on these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may need explicit exposure to profiling data and memory-layout reasoning before they can close the observed gap.
  • Future benchmarks could add cost models that penalize both time and memory simultaneously rather than treating them separately.
  • If SWE-Pro tasks are added to training corpora, measured gains on the benchmark itself would need to be checked against held-out projects to guard against overfitting.

Load-bearing premise

The 102 expert-written optimizations collected from open-source projects form a representative proxy for the full complexity of real-world repository-level performance optimization.

What would settle it

Running the same LLMs on the SWE-Pro tasks and obtaining aggregate speedups and memory reductions comparable to the expert baseline of 15.5x and 171.3x.

Figures

Figures reproduced from arXiv: 2606.25530 by Chunyang Chen, Ezgi Sar{\i}kayak, Hesham Ghonim, Wenchao Gu.

Figure 2
Figure 2. Figure 2: SWE-Pro performance measurement framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Distribution of SWE￾Pro instances across reposito￾ries. Dataset Distribution. The final dataset consists of 102 validated benchmark instances collected across three repositories [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task progression through the evaluation pipeline for each model under Oracle retrieval. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of input configuration on optimization effectiveness for PR #7578 under Oracle [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of input configuration on optimization effectiveness for PR #7578 under BM25 [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
read the original abstract

Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and the variability introduced by different input data and execution conditions. We address this by introducing SWE-Pro, a repository-level benchmark derived from 102 expert-written optimizations from open-source projects. Unlike previous benchmarks, SWE-Pro pairs each task with parameterized tests to evaluate runtime, peak memory, and Time-Weighted Memory Usage (TWMU) across varying input data and execution conditions under noise-aware measurement conditions. Our evaluation shows that current LLMs struggle significantly: runtime gains are negligible, and memory optimizations are nearly non-existent. This stands in sharp contrast to expert implementations, which achieve an aggregate speedup of 15.5x and peak memory reduction of 171.3x over benchmark tasks. Expert-written improvements are observed in 91.2% of tasks for runtime and 65.7% for peak memory. Our findings expose a substantial gap between current LLM capabilities and the demands of expert-level engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SWE-Pro, a repository-level benchmark derived from 102 expert-written optimizations collected from open-source projects. Each task is paired with parameterized tests to measure runtime, peak memory, and Time-Weighted Memory Usage (TWMU) under varying inputs and noise-aware conditions. Evaluation of current LLMs shows negligible runtime gains and near-absent memory optimizations, contrasting with expert implementations that deliver 15.5x aggregate speedup, 171.3x peak memory reduction, and improvements in 91.2% (runtime) and 65.7% (memory) of tasks.

Significance. If the benchmark tasks prove representative and the quantitative results hold after full methodological disclosure, the work would demonstrate a clear capability gap between current LLMs and expert-level repository performance optimization. The parameterized tests and explicit noise-aware protocol are positive features that increase realism over prior single-function or single-metric benchmarks. The findings could usefully direct future research on LLM code refinement toward handling trade-offs and measurement variability.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): the selection criteria, domain coverage, and verification process for the 102 expert optimizations are not stated. This is load-bearing for the central claim because the reported LLM-expert gap (negligible vs. 15.5x / 171.3x) cannot be interpreted without evidence that the tasks were not filtered for cases already known to admit large expert gains.
  2. [Abstract and Evaluation section] Abstract and Evaluation section: the paper reports specific quantitative outcomes (15.5x speedup, 91.2% improvement rate, etc.) but supplies no information on which LLMs were tested, the prompting strategies employed, the exact statistical methods for aggregating results, or how measurement noise was quantified and thresholded. These omissions prevent verification or reproduction of the claim that LLMs achieve only negligible gains.
  3. [Evaluation section] Evaluation section: the description of the noise-aware measurement protocol and parameterized tests lacks quantitative detail on input-parameter ranges, number of repetitions, or the precise definition of 'negligible' gains. Without these, it is impossible to assess whether the expert baselines are robust or whether the LLM results are sensitive to the chosen noise model.
minor comments (1)
  1. [Abstract] The acronym TWMU is introduced without an explicit expansion on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments identifying areas requiring greater methodological transparency. We will revise the manuscript to incorporate the requested details on benchmark construction, LLM evaluation setup, and measurement protocols. This will strengthen the interpretability of the LLM-expert performance gap without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the selection criteria, domain coverage, and verification process for the 102 expert optimizations are not stated. This is load-bearing for the central claim because the reported LLM-expert gap (negligible vs. 15.5x / 171.3x) cannot be interpreted without evidence that the tasks were not filtered for cases already known to admit large expert gains.

    Authors: We acknowledge the omission in the current draft. In the revised §3, we will add explicit selection criteria (commits with measurable performance impact from open-source repos, diversity across domains like databases, ML pipelines, and web servers), domain coverage breakdown (e.g., 35% data-intensive, 28% compute-bound), and verification process (independent review by two authors plus automated test validation ensuring parameterized tests pass on original and optimized code). This will confirm representative sampling without post-hoc filtering for large gains. revision: yes

  2. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the paper reports specific quantitative outcomes (15.5x speedup, 91.2% improvement rate, etc.) but supplies no information on which LLMs were tested, the prompting strategies employed, the exact statistical methods for aggregating results, or how measurement noise was quantified and thresholded. These omissions prevent verification or reproduction of the claim that LLMs achieve only negligible gains.

    Authors: The manuscript draft lacks these specifics. We will expand the Evaluation section to list the exact LLMs (GPT-4o, Claude-3.5-Sonnet, Llama-3-70B, etc.), prompting strategies (zero-shot with repository context, chain-of-thought, and retrieval-augmented examples), aggregation methods (geometric mean speedups with 95% bootstrap CIs), and noise thresholding (gains <2% after subtracting 1-sigma measurement variance classified as negligible). This enables full reproduction. revision: yes

  3. Referee: [Evaluation section] Evaluation section: the description of the noise-aware measurement protocol and parameterized tests lacks quantitative detail on input-parameter ranges, number of repetitions, or the precise definition of 'negligible' gains. Without these, it is impossible to assess whether the expert baselines are robust or whether the LLM results are sensitive to the chosen noise model.

    Authors: We agree additional quantitative detail is needed. The revision will specify input-parameter ranges (e.g., array sizes 10^3 to 10^6, concurrency levels 1-32), repetitions (minimum 20 runs per configuration with outlier rejection), and 'negligible' definition (runtime/memory change <5% relative to noise floor, where noise floor is std dev across repeated measurements on identical binaries). We will also add a sensitivity table showing results under alternative noise models. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark comparison with no derivational chain

full rationale

The paper introduces SWE-Pro as a new repository-level benchmark constructed from 102 expert optimizations and evaluates LLMs against expert baselines using parameterized tests and noise-aware measurements. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The central claims are direct empirical observations (LLM gains negligible vs. expert 15.5x/171.3x) on the collected tasks; representativeness is an external validity concern, not a circular reduction of the reported results to their own construction. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the 102 tasks and the validity of the measurement protocol; both are introduced by the paper without external validation or prior published justification visible in the abstract.

free parameters (1)
  • Task selection criteria for the 102 optimizations
    The specific rules used to choose which expert optimizations to include are not stated and directly affect the reported performance gap.
axioms (1)
  • domain assumption Parameterized tests under noise-aware conditions accurately capture real-world performance variability and trade-offs.
    Invoked when the abstract states that the benchmark evaluates across varying input data and execution conditions.

pith-pipeline@v0.9.1-grok · 5774 in / 1393 out tokens · 32579 ms · 2026-06-25T20:07:53.932161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 1 canonical work pages

  1. [1]

    SWE-perf: Can language models optimize code performance on real-world repos- itories? InSubmitted to The Fourteenth International Conference on Learning Representations,

    Anonymous. SWE-perf: Can language models optimize code performance on real-world repos- itories? InSubmitted to The Fourteenth International Conference on Learning Representations,

  2. [2]

    Claude sonnet 4.6

    Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , 2026

  3. [3]

    Prasanna Balaprakash, Ananta Tiwari, and Stefan M. Wild. Multi objective optimization of hpc kernels for performance, power, and energy. InPMBS@SC, 2013

  4. [4]

    John Wiley & Sons, Ltd, 2017

    Phillip Borman and David Elder.Q2(R1) Validation of Analytical Procedures, chapter 5, pages 127–166. John Wiley & Sons, Ltd, 2017

  5. [5]

    Mercury: a code efficiency benchmark for code large language models

    Mingzhe Du, Luu Anh Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. Mercury: a code efficiency benchmark for code large language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc

  6. [6]

    Lyu.Search-Based LLMs for Code Optimization, page 578–590

    Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu.Search-Based LLMs for Code Optimization, page 578–590. IEEE Press, 2025

  7. [7]

    Statistically rigorous java performance evaluation

    Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous java performance evaluation. InProceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, OOPSLA ’07, page 57–76, New York, NY , USA, 2007. Association for Computing Machinery. 10

  8. [8]

    Language models for code optimization: Survey, challenges and future directions, 2025

    Jingzhi Gong, Vardan V oskanyan, Paul Brookes, Fan Wu, Wei Jie, Jie Xu, Rafail Giavrimis, Mike Basios, Leslie Kanthan, and Zheng Wang. Language models for code optimization: Survey, challenges and future directions, 2025

  9. [9]

    Gemini 3.1 flash

    Google DeepMind. Gemini 3.1 flash. https://deepmind.google/technologies/ gemini/, 2026

  10. [10]

    Effibench: Bench- marking the efficiency of automatically generated code

    Dong HUANG, Yuhao QING, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Bench- marking the efficiency of automatically generated code. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  11. [11]

    Monzurul Amin Ifath and Israat Haque

    Md. Monzurul Amin Ifath and Israat Haque. Characterizing performance–energy trade-offs of large language models in multi-request workflows.Proc. ACM Meas. Anal. Comput. Syst., 10(1), March 2026

  12. [12]

    Bruce Jacob and Trevor N. Mudge. Notes on calculating computer performance. Technical Report CSE-TR-231-95, University of Michigan, EECS Department, Advanced Computer Architecture Lab, 1995. Technical Report

  13. [13]

    A survey on large language models for code generation.ACM Trans

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Trans. Softw. Eng. Methodol., 35(2), January 2026

  14. [14]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

  15. [15]

    Understanding and detecting real-world performance bugs

    Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. Understanding and detecting real-world performance bugs. InProceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, page 77–88, New York, NY , USA, 2012. Association for Computing Machinery

  16. [16]

    Rigorous benchmarking in reasonable time

    Tomas Kalibera and Richard Jones. Rigorous benchmarking in reasonable time. InProceedings of the 2013 International Symposium on Memory Management, ISMM ’13, page 63–74, New York, NY , USA, 2013. Association for Computing Machinery

  17. [17]

    Gall, and Philipp Leitner

    Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. Dynamically reconfig- uring software microbenchmarks: reducing execution time without sacrificing result quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, page 989–1...

  18. [18]

    Input sensitivity on the performance of configurable systems an empirical study.Journal of Systems and Software, 201:111671, 2023

    Luc Lesoil, Mathieu Acher, Arnaud Blouin, and Jean-Marc Jézéquel. Input sensitivity on the performance of configurable systems an empirical study.Journal of Systems and Software, 201:111671, 2023

  19. [19]

    Pyserini: An easy-to-use python toolkit to support replicable IR research with sparse and dense representations.CoRR, abs/2102.10073, 2021

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. Pyserini: An easy-to-use python toolkit to support replicable IR research with sparse and dense representations.CoRR, abs/2102.10073, 2021

  20. [20]

    Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025

    Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?, 2025

  21. [21]

    MiniMax M27.https://www.minimax.io/news/minimax-m25, 2026

    MiniMax. MiniMax M27.https://www.minimax.io/news/minimax-m25, 2026

  22. [22]

    Kimi K2.5

    Moonshot AI. Kimi K2.5. https://platform.kimi.ai/docs/guide/ kimi-k2-5-quickstart, 2026

  23. [23]

    Analyzing the impact of workloads on modeling the performance of configurable software systems

    Stefan Mühlbauer, Florian Sattler, Christian Kaltenecker, Johannes Dorn, Sven Apel, and Norbert Siegmund. Analyzing the impact of workloads on modeling the performance of configurable software systems. InProceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 2085–2097. IEEE Press, 2023. 11

  24. [24]

    Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. Producing wrong data without doing anything obviously wrong! InProceedings of the 14th International Confer- ence on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV , page 265–276, New York, NY , USA, 2009. Association for Computing Machinery

  25. [25]

    Discovering, reporting, and fixing performance bugs

    Adrian Nistor, Tian Jiang, and Lin Tan. Discovering, reporting, and fixing performance bugs. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 237–246, 2013

  26. [26]

    Introducing GPT-5.2

    OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ , 2025

  27. [27]

    Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo

    Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Perfcodegen: Improving performance of llm generated code with execution feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 1–13, 2025

  28. [28]

    Coffe: A code efficiency benchmark for code generation, 2025

    Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. Coffe: A code efficiency benchmark for code generation, 2025

  29. [29]

    Rui Pereira, Marco Couto, Francisco Ribeiro, Rui Rua, Jácome Cunha, João Paulo Fernandes, and João Saraiva. Energy efficiency across programming languages: how do energy, time, and memory relate? InProceedings of the 10th ACM SIGPLAN International Conference on Software Language Engineering, SLE 2017, page 256–267, New York, NY , USA, 2017. Association fo...

  30. [30]

    tracemalloc — Trace memory allocations

    Python Software Foundation. tracemalloc — Trace memory allocations. https://docs. python.org/3/library/tracemalloc.html, 2026. Python Standard Library, version 3.14.3 documentation

  31. [31]

    Zhang, Heming Cui, Siu Ming Yiu, Dong HUANG, See-Kiong Ng, and Anh Tuan Luu

    Yuhao QING, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu Ming Yiu, Dong HUANG, See-Kiong Ng, and Anh Tuan Luu. Effibench-x: A multi-language benchmark for measuring efficiency of LLM-generated code. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  32. [32]

    How efficient is llm-generated code? a rigorous & high-standard benchmark, 2025

    Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, and Hanghang Tong. How efficient is llm-generated code? a rigorous & high-standard benchmark, 2025

  33. [33]

    A large-scale empirical study on mobile performance: energy, run-time and memory.Empirical Softw

    Rui Rua and João Saraiva. A large-scale empirical study on mobile performance: energy, run-time and memory.Empirical Softw. Engg., 29(1), December 2023

  34. [34]

    Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E

    Ripon K. Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E. Perry. Improving bug localization using structured information retrieval. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 345–355, 2013

  35. [35]

    Gso: Challenging software optimization tasks for evaluating swe-agents, 2025

    Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents, 2025

  36. [36]

    Learning performance-improving code edits, 2024

    Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdanbakhsh. Learning performance-improving code edits, 2024

  37. [37]

    Nicolas van Kempen, Hyukje Kwon, Dung Trung Nguyen, and E. Berger. It’s not easy being green: On the energy efficiency of programming languages.2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1553–1565, 2024

  38. [38]

    Ecco: Can we improve model-generated code efficiency without sacrificing functional correctness?, 2024

    Siddhant Waghjale, Vishruth Veerendranath, Zora Zhiruo Wang, and Daniel Fried. Ecco: Can we improve model-generated code efficiency without sacrificing functional correctness?, 2024

  39. [39]

    Perfgen: Automated performance benchmark generation for big data analytics, 2024

    Jiyuan Wang, Jason Teoh, Muhammand Ali Gulza, Qian Zhang, and Miryung Kim. Perfgen: Automated performance benchmark generation for big data analytics, 2024

  40. [40]

    GLM-51.https://z.ai/blog/glm-5.1, 2026

    Z.ai. GLM-51.https://z.ai/blog/glm-5.1, 2026. 12

  41. [41]

    Yutong Zhao, Lu Xiao, Xiao Wang, Lei Sun, Bihuan Chen, Yang Liu, and Andre B. Bondi. How are performance issues caused and resolved?-an empirical study from a design perspective. InProceedings of the ACM/SPEC International Conference on Performance Engineering, ICPE ’20, page 181–192, New York, NY , USA, 2020. Association for Computing Machinery

  42. [42]

    "" PR #50778 - Benchmark Series.replace when using a large dict for to_replace. Parameters: - N: Series length - R: Number of values to replace

    Fida Zubair, Maryam Al-Hitmi, and Cagatay Catal. The use of large language models for program repair.Comput. Stand. Interfaces, 93(C), April 2025. A Dataset Construction and Statistics This section provides implementation-level details of the dataset construction process, complementing Section 2. We focus on repository-specific configurations and the resu...