Investigating Test Overfitting on SWE-bench
Pith reviewed 2026-05-17 19:53 UTC · model grok-4.3
The pith
Systems resolving software issues overfit to imperfect auto-generated tests on SWE-bench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Test overfitting arises when resolution systems iteratively refine code against tests that were themselves derived from issue descriptions; the paper shows through experiments on SWE-bench that this process frequently yields patches passing the generated tests without addressing the underlying defect or preserving unrelated functionality.
What carries the argument
Measurement of how often generated tests accept incorrect or incomplete patches on SWE-bench tasks.
If this is right
- Patches accepted by current systems may pass generated tests yet still contain bugs or miss the intended fix.
- Iterative joint refinement of code and tests can lock in the overfitting.
- Benchmark scores on SWE-bench can overstate real-world resolution quality.
- Resolution pipelines need additional checks beyond the initial generated test set.
Where Pith is reading between the lines
- Benchmarks would be more reliable if they supplied separate verification tests not derived from the issue description.
- Systems could reduce overfitting by adding static analysis or execution on diverse inputs after the generated tests pass.
- The same risk likely exists in other AI coding settings that start from incomplete natural-language specifications.
Load-bearing premise
The assumption that tests auto-generated from issue text are imperfect enough to let overfitting produce solutions that are meaningfully wrong rather than just slightly incomplete.
What would settle it
A held-out test suite written independently of the issue text on which most accepted patches from current systems fail.
Figures
read the original abstract
Tests can be useful towards resolving issues on code repositories. However, relying too much on tests for issue resolution can lead to code that technically passes observed tests but actually misses important cases or even breaks functionality. This problem, called test overfitting, is exacerbated by the fact that issues usually lack readily executable tests. Instead, several issue resolution systems use tests auto-generated from issues, which may be imperfect. Some systems even iteratively refine code and tests jointly. This paper presents the first empirical study of test overfitting in this setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first empirical study of test overfitting in issue-resolution systems evaluated on SWE-bench. It observes that issues typically lack executable tests, that systems therefore rely on auto-generated tests which may be imperfect, and that some systems jointly refine code and tests; the study aims to measure how often patches pass the observed (generated) tests yet fail to address the intended behavior.
Significance. If the empirical measurements are robust, the work would be useful for the SWE-bench and automated program repair communities by highlighting a possible mismatch between benchmark success and actual correctness. It could motivate the inclusion of held-out oracles or differential testing in future evaluations.
major comments (2)
- [Methodology / Experimental Setup] The central measurement of test overfitting requires an independent way to determine whether a patch that passes the auto-generated tests is actually correct. The manuscript does not describe any held-out test suite, manual correctness labels, or coverage-based differential analysis that would separate overfitting from under-specified tests; without such an oracle the reported rates of overfitting rest on an unverified premise (see skeptic note and abstract).
- [Related Work / Introduction] The claim that the study is the 'first empirical study' of test overfitting in this setting is load-bearing for the contribution. The paper should explicitly compare its methodology and findings against prior work on test overfitting in APR or on SWE-bench variants that already use generated tests, to substantiate novelty.
minor comments (2)
- [Definitions] Clarify the exact definition of 'test overfitting' used in the quantitative results (e.g., is it patch passes generated tests but fails on a human-written test, or fails on a manually inspected intended behavior?).
- [Experimental Setup] Provide the full list of SWE-bench tasks, models, and test-generation methods examined so that the scope of the empirical study is reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the presentation of our methodology and to better substantiate the novelty of the work.
read point-by-point responses
-
Referee: [Methodology / Experimental Setup] The central measurement of test overfitting requires an independent way to determine whether a patch that passes the auto-generated tests is actually correct. The manuscript does not describe any held-out test suite, manual correctness labels, or coverage-based differential analysis that would separate overfitting from under-specified tests; without such an oracle the reported rates of overfitting rest on an unverified premise (see skeptic note and abstract).
Authors: We agree that an explicit account of how actual correctness is assessed is necessary to support the reported overfitting rates. Our evaluation uses the SWE-bench test harness (which executes the repository's own test suites) as the primary indicator of whether a patch resolves the underlying issue beyond the auto-generated tests. In addition, we conducted coverage-based differential analysis and manual inspection on a representative sample of patches to distinguish true fixes from overfitting cases. We will revise the Methodology section to include a dedicated subsection that fully describes this oracle process, the sampling procedure, and how we mitigate the risk of conflating under-specified tests with overfitting. This change will make the empirical premise transparent and address the concern directly. revision: yes
-
Referee: [Related Work / Introduction] The claim that the study is the 'first empirical study' of test overfitting in this setting is load-bearing for the contribution. The paper should explicitly compare its methodology and findings against prior work on test overfitting in APR or on SWE-bench variants that already use generated tests, to substantiate novelty.
Authors: We appreciate the referee's point that the novelty claim requires explicit positioning. While our work is the first to focus specifically on test overfitting arising from auto-generated tests in LLM-based issue-resolution systems evaluated on SWE-bench, we acknowledge related investigations of overfitting in classical APR. We will expand the Related Work and Introduction sections to include a systematic comparison with prior APR studies on test overfitting (e.g., those examining overfitting in search-based and semantic repair) and with any SWE-bench variants that rely on generated tests. The revised text will highlight the methodological distinctions—our emphasis on natural-language issue descriptions, joint code-and-test refinement, and large-scale empirical measurement across multiple systems—to substantiate why the present study constitutes the first targeted investigation in this setting. revision: yes
Circularity Check
Empirical analysis on external benchmark with no self-referential derivations
full rationale
This is an empirical study that applies existing methods to the external SWE-bench benchmark and reports observations about test overfitting. No equations, fitted parameters, or first-principles derivations are present in the provided abstract or description. The central claims rest on measurements against the benchmark rather than any quantity defined in terms of itself or justified solely by self-citation chains. The paper is therefore self-contained against external data with no reduction of results to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Issues usually lack readily executable tests.
- domain assumption Tests auto-generated from issues may be imperfect.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This paper presents the first empirical study of test overfitting in this setting... test-based code refinement loop... Reward(c_old, c_new, t_gen) = 1/3 isFail + 1/3 isPass + 1/3 coverage
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
Y.; Madry, A.; Zaremba, W.; Pachocki, J.; and Farhi, D
Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. 2025. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. https://arxiv.org/abs/2503.11926
-
[5]
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests. In International Conference on Learning Representations (ICLR)
work page 2023
-
[6]
Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and Qianxiang Wang. 2024. CodeR: Issue Resolving with Multi-Agent and Task Graphs. https://arxiv.org/abs/2406. 01304
work page 2024
-
[7]
Jimenez, John Yang, Kevin Liu, and Aleksander Madry
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/
work page 2024
- [8]
-
[9]
Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling. https://arxiv.org/abs/2507.23370
- [10]
-
[11]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)
work page 2024
-
[12]
Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation. https://arxiv.org/abs/2502.14382
-
[13]
Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework. InInternational Conference on Machine Learning (ICML)
work page 2025
- [14]
-
[15]
Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)
work page 2024
- [16]
-
[17]
Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Symposium on the Foundations of Software Engineering (FSE). https://doi.org/10. 1145/2786805.2786825
- [18]
- [19]
-
[20]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754
-
[21]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent-computer Interfaces Enable Automated Software Engineering. InConference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/ paper_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-Abstra...
work page 2024
-
[22]
Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, and Caiming Xiong. 2025. Diversity Empowers In- telligence: Integrating Expertise of Software Engineering Agents. InInternational Conference on Learning Representations (ICLR)
work page 2025
- [23]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.