arxiv: 2511.16858 · v3 · submitted 2025-11-20 · 💻 cs.SE · cs.LG

Investigating Test Overfitting on SWE-bench

Toufique Ahmed , Jatin Ganhotra , Avraham Shinnar , Martin Hirzel This is my paper

Pith reviewed 2026-05-17 19:53 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords test overfittingSWE-benchissue resolutioncode repairauto-generated testsempirical studysoftware engineering

0 comments p. Extension

The pith

Systems resolving software issues overfit to imperfect auto-generated tests on SWE-bench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts the first empirical study of test overfitting in issue resolution systems that rely on tests auto-generated from natural language issue descriptions. These systems often lack access to ready-made executable tests, so they create their own from the issue text and then use those tests to guide code changes. When the generated tests are incomplete or miss key cases, the resulting fixes can pass the tests while leaving the original problem unsolved or creating new failures. The study examines this dynamic on the SWE-bench benchmark to quantify how often overfitting occurs. Readers should care because many current AI-driven repair tools follow this pattern and may therefore report inflated success rates.

Core claim

Test overfitting arises when resolution systems iteratively refine code against tests that were themselves derived from issue descriptions; the paper shows through experiments on SWE-bench that this process frequently yields patches passing the generated tests without addressing the underlying defect or preserving unrelated functionality.

What carries the argument

Measurement of how often generated tests accept incorrect or incomplete patches on SWE-bench tasks.

If this is right

Patches accepted by current systems may pass generated tests yet still contain bugs or miss the intended fix.
Iterative joint refinement of code and tests can lock in the overfitting.
Benchmark scores on SWE-bench can overstate real-world resolution quality.
Resolution pipelines need additional checks beyond the initial generated test set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks would be more reliable if they supplied separate verification tests not derived from the issue description.
Systems could reduce overfitting by adding static analysis or execution on diverse inputs after the generated tests pass.
The same risk likely exists in other AI coding settings that start from incomplete natural-language specifications.

Load-bearing premise

The assumption that tests auto-generated from issue text are imperfect enough to let overfitting produce solutions that are meaningfully wrong rather than just slightly incomplete.

What would settle it

A held-out test suite written independently of the issue text on which most accepted patches from current systems fail.

Figures

Figures reproduced from arXiv: 2511.16858 by Avraham Shinnar, Jatin Ganhotra, Martin Hirzel, Toufique Ahmed.

**Figure 2.** Figure 2: Overview of test-based code refinement [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: How severe is test overfitting? (Claude-3.7-Sonnet) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Coverage of the unbiased and overfitted patches. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Tests can be useful towards resolving issues on code repositories. However, relying too much on tests for issue resolution can lead to code that technically passes observed tests but actually misses important cases or even breaks functionality. This problem, called test overfitting, is exacerbated by the fact that issues usually lack readily executable tests. Instead, several issue resolution systems use tests auto-generated from issues, which may be imperfect. Some systems even iteratively refine code and tests jointly. This paper presents the first empirical study of test overfitting in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Test overfitting is a real risk worth flagging for SWE-bench evaluations, but the paper struggles to separate it cleanly from under-specified tests.

read the letter

The main takeaway is that test overfitting deserves more attention in SWE-bench evaluations, especially when tests come from auto-generation off issue text, but the paper's ability to isolate it from weak test specifications is limited. This work is new in targeting the SWE-bench setup specifically. It is the first empirical look at how repair systems might overfit to tests that were themselves derived from the problem description rather than from a full test suite. The authors correctly note that real issues often come without executable tests, so generation is a common workaround, and some approaches even co-evolve the tests. What the paper does well is to frame the risk clearly. Passing auto-generated tests does not guarantee that the patch addresses the underlying issue or avoids breaking other behavior. Highlighting systems that iterate on both code and tests makes the point sharper. The soft spot is in the evidence for actual overfitting. To claim overfitting, the study needs a way to check correctness independently of the observed tests. If they rely only on the generated tests or on the original issue text, it becomes difficult to tell whether a passing patch is overfitting or whether the tests never captured the full requirement. The concern in the stress-test note holds up: without an oracle or additional validation, the measurement stays entangled with under-specified tests. Overall, the central argument about the risk is reasonable, but the quantification may need stronger grounding. This paper is aimed at people working on program repair benchmarks and AI coding agents. Anyone evaluating tools on SWE-bench or similar would benefit from reading it as a caution. It is not a definitive measurement, but it opens the question. I would bring this to a reading group for discussion on benchmark reliability. It deserves peer review because the topic is relevant and the empirical intent is clear, even if revisions will be needed on the methodology details.

Referee Report

2 major / 2 minor

Summary. The paper presents the first empirical study of test overfitting in issue-resolution systems evaluated on SWE-bench. It observes that issues typically lack executable tests, that systems therefore rely on auto-generated tests which may be imperfect, and that some systems jointly refine code and tests; the study aims to measure how often patches pass the observed (generated) tests yet fail to address the intended behavior.

Significance. If the empirical measurements are robust, the work would be useful for the SWE-bench and automated program repair communities by highlighting a possible mismatch between benchmark success and actual correctness. It could motivate the inclusion of held-out oracles or differential testing in future evaluations.

major comments (2)

[Methodology / Experimental Setup] The central measurement of test overfitting requires an independent way to determine whether a patch that passes the auto-generated tests is actually correct. The manuscript does not describe any held-out test suite, manual correctness labels, or coverage-based differential analysis that would separate overfitting from under-specified tests; without such an oracle the reported rates of overfitting rest on an unverified premise (see skeptic note and abstract).
[Related Work / Introduction] The claim that the study is the 'first empirical study' of test overfitting in this setting is load-bearing for the contribution. The paper should explicitly compare its methodology and findings against prior work on test overfitting in APR or on SWE-bench variants that already use generated tests, to substantiate novelty.

minor comments (2)

[Definitions] Clarify the exact definition of 'test overfitting' used in the quantitative results (e.g., is it patch passes generated tests but fails on a human-written test, or fails on a manually inspected intended behavior?).
[Experimental Setup] Provide the full list of SWE-bench tasks, models, and test-generation methods examined so that the scope of the empirical study is reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the presentation of our methodology and to better substantiate the novelty of the work.

read point-by-point responses

Referee: [Methodology / Experimental Setup] The central measurement of test overfitting requires an independent way to determine whether a patch that passes the auto-generated tests is actually correct. The manuscript does not describe any held-out test suite, manual correctness labels, or coverage-based differential analysis that would separate overfitting from under-specified tests; without such an oracle the reported rates of overfitting rest on an unverified premise (see skeptic note and abstract).

Authors: We agree that an explicit account of how actual correctness is assessed is necessary to support the reported overfitting rates. Our evaluation uses the SWE-bench test harness (which executes the repository's own test suites) as the primary indicator of whether a patch resolves the underlying issue beyond the auto-generated tests. In addition, we conducted coverage-based differential analysis and manual inspection on a representative sample of patches to distinguish true fixes from overfitting cases. We will revise the Methodology section to include a dedicated subsection that fully describes this oracle process, the sampling procedure, and how we mitigate the risk of conflating under-specified tests with overfitting. This change will make the empirical premise transparent and address the concern directly. revision: yes
Referee: [Related Work / Introduction] The claim that the study is the 'first empirical study' of test overfitting in this setting is load-bearing for the contribution. The paper should explicitly compare its methodology and findings against prior work on test overfitting in APR or on SWE-bench variants that already use generated tests, to substantiate novelty.

Authors: We appreciate the referee's point that the novelty claim requires explicit positioning. While our work is the first to focus specifically on test overfitting arising from auto-generated tests in LLM-based issue-resolution systems evaluated on SWE-bench, we acknowledge related investigations of overfitting in classical APR. We will expand the Related Work and Introduction sections to include a systematic comparison with prior APR studies on test overfitting (e.g., those examining overfitting in search-based and semantic repair) and with any SWE-bench variants that rely on generated tests. The revised text will highlight the methodological distinctions—our emphasis on natural-language issue descriptions, joint code-and-test refinement, and large-scale empirical measurement across multiple systems—to substantiate why the present study constitutes the first targeted investigation in this setting. revision: yes

Circularity Check

0 steps flagged

Empirical analysis on external benchmark with no self-referential derivations

full rationale

This is an empirical study that applies existing methods to the external SWE-bench benchmark and reports observations about test overfitting. No equations, fitted parameters, or first-principles derivations are present in the provided abstract or description. The central claims rest on measurements against the benchmark rather than any quantity defined in terms of itself or justified solely by self-citation chains. The paper is therefore self-contained against external data with no reduction of results to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the nature of software issues and test generation rather than new free parameters or invented entities.

axioms (2)

domain assumption Issues usually lack readily executable tests.
Directly stated in the abstract as a fact that exacerbates test overfitting.
domain assumption Tests auto-generated from issues may be imperfect.
Assumed in the abstract to contribute to code that passes tests but misses cases or breaks functionality.

pith-pipeline@v0.9.0 · 5379 in / 1161 out tokens · 60544 ms · 2026-05-17T19:53:19.274157+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper presents the first empirical study of test overfitting in this setting... test-based code refinement loop... Reward(c_old, c_new, t_gen) = 1/3 isFail + 1/3 isPass + 1/3 coverage

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. 2025. Execution-Feedback Driven Test Generation from SWE Issues. https://arxiv.org/ abs/2508.06365

work page arXiv 2025
[2]

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? https://arxiv.org/abs/2412.02883

work page arXiv 2024
[3]

Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, and Nagarajan Natarajan. 2024. MASAI: Modular Architecture for Software-engineering AI Agents. https://arxiv.org/ abs/2406.11638

work page arXiv 2024
[4]

Y.; Madry, A.; Zaremba, W.; Pachocki, J.; and Farhi, D

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. 2025. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. https://arxiv.org/abs/2503.11926

work page arXiv 2025
[5]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests. In International Conference on Learning Representations (ICLR)

work page 2023
[6]

Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and Qianxiang Wang. 2024. CodeR: Issue Resolving with Multi-Agent and Task Graphs. https://arxiv.org/abs/2406. 01304

work page 2024
[7]

Jimenez, John Yang, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/

work page 2024
[8]

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, and Azalia Mirhoseini. 2025. CodeMonkeys: Scaling Test-Time Compute for Software Engineering. https://arxiv.org/abs/2501.14723

work page arXiv 2025
[9]

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling. https://arxiv.org/abs/2507.23370

work page arXiv 2025
[10]

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. 2025. R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents. https://arxiv.org/abs/2504.07164

work page arXiv 2025
[11]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

work page 2024
[12]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation. https://arxiv.org/abs/2502.14382

work page arXiv 2025
[13]

Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework. InInternational Conference on Machine Learning (ICML)

work page 2025
[14]

KeFan Li, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, and Weifeng Lv. 2025. InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution. https://arxiv.org/abs/2511.16004

work page arXiv 2025
[15]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)

work page 2024
[16]

Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. SpecRover: Code Intent Extraction via LLMs. https://arxiv.org/abs/2408.02232

work page arXiv 2024
[17]

Smith, Earl T

Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Symposium on the Foundations of Software Engineering (FSE). https://doi.org/10. 1145/2786805.2786825

work page arXiv 2015
[18]

Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan. 2024. Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers. https://arxiv. org/abs/2411.17501

work page arXiv 2024
[19]

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning. https://arxiv.org/abs/ 2506.03136

work page arXiv 2025
[20]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754

work page doi:10.1145/3715754 2025
[21]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent-computer Interfaces Enable Automated Software Engineering. InConference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/ paper_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-Abstra...

work page 2024
[22]

Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, and Caiming Xiong. 2025. Diversity Empowers In- telligence: Integrating Expertise of Software Engineering Agents. InInternational Conference on Learning Representations (ICLR)

work page 2025
[23]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InInternational Symposium on Software Testing and Analysis (ISSTA). 1592–1604. https://doi.org/10.1145/ 3650212.3680384

work page arXiv 2024