Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
How do agents perform code optimization? an empirical study,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.SE 2years
2026 2representative citing papers
JETO-Mine is a reusable three-phase pipeline that mines 1.8 million Java commits to produce JETO-Bench containing 91 verified executable ETIPs, on which OpenHands succeeds at 14.3%.
citing papers explorer
-
JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java
JETO-Mine is a reusable three-phase pipeline that mines 1.8 million Java commits to produce JETO-Bench containing 91 verified executable ETIPs, on which OpenHands succeeds at 14.3%.