pith. sign in

arxiv: 2507.12415 · v2 · pith:CI2ZWZYYnew · submitted 2025-07-16 · 💻 cs.SE

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

classification 💻 cs.SE
keywords codeperformancellmsoptimizationswe-perfbenchmarkcriticallanguage
0
0 comments X
read the original abstract

Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span file-level and repo-level approaches (e.g., Agentless and OpenHands), we reveal a substantial capability gap between existing LLMs and expert-level optimization performance, highlighting critical research opportunities in this emerging field.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

    cs.SE 2026-05 unverdicted novelty 7.0

    SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

  3. BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

    cs.SE 2026-05 unverdicted novelty 7.0

    BootstrapAgent distills repository bootstrapping heuristics into a persistent .bootstrap contract via multi-agent evidence extraction, Docker verification, and trace-driven repair, reporting 92.9% success and efficien...

  4. Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

    cs.AR 2026-05 unverdicted novelty 7.0

    Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting perfo...

  5. CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

    cs.SE 2026-05 accept novelty 7.0

    CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.

  6. PlayCoder: Making LLM-Generated GUI Code Playable

    cs.SE 2026-04 conditional novelty 7.0

    PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

  7. JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java

    cs.SE 2026-06 conditional novelty 6.0

    JETO-Mine is a reusable three-phase pipeline that mines 1.8 million Java commits to produce JETO-Bench containing 91 verified executable ETIPs, on which OpenHands succeeds at 14.3%.

  8. Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

    cs.LO 2026-05 unverdicted novelty 6.0

    Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.

  9. SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

    cs.LG 2026-05 unverdicted novelty 6.0

    SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.

  10. Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

    cs.SE 2026-04 unverdicted novelty 6.0

    LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.