The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SE 3representative citing papers
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.
citing papers explorer
-
Investigating Test Overfitting on SWE-bench
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
-
Reproduction Test Generation for Java SWE Issues
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
-
Can Old Tests Do New Tricks for Resolving SWE Issues?
TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.