pith. sign in

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Recent progress in automated repair of performance bugs demands realistic, executable benchmarks. However, existing C++ performance benchmarks are largely built from competitive programming submissions, and recent real-world benchmarks predominantly target Python and .NET. To fill this gap, we present CppPerf-Mine, a configurable pipeline that mines execution-time-improving patches from open-source C++ repositories on GitHub by combining structural commit filtering, an LLM-based commit classifier, and a containerized build & test stage that produces fully reproducible Docker images for each patch. Using CppPerf-Mine, we build CppPerf-DB, a benchmark comprising 347 manually verified patches from 42 mature C++ repositories, 39% of which are multi-file, enabling the evaluation of repository-level repair tools. In our preliminary study, OpenHands correctly fixes only 13.5% of the patches in CppPerf-DB, confirming that real-world C++ performance repair remains an open challenge. CppPerf-Mine and CppPerf-DB are open-source and publicly available at: https://doi.org/10.5281/zenodo.20097425. In addition, a demonstration video is available at: https://www.youtube.com/watch?v=nixlupIgSdM.

fields

cs.SE 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents? cs.SE · 2026-07-01 · unverdicted · none · ref 51 · internal anchor

    Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.