pith. the verified trust layer for science. sign in

arxiv: 2511.16858 · v3 · submitted 2025-11-20 · 💻 cs.SE · cs.LG

Investigating Test Overfitting on SWE-bench

Pith reviewed 2026-05-17 19:53 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords test overfittingSWE-benchissue resolutioncode repairauto-generated testsempirical studysoftware engineering
0
0 comments X p. Extension

The pith

Systems resolving software issues overfit to imperfect auto-generated tests on SWE-bench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts the first empirical study of test overfitting in issue resolution systems that rely on tests auto-generated from natural language issue descriptions. These systems often lack access to ready-made executable tests, so they create their own from the issue text and then use those tests to guide code changes. When the generated tests are incomplete or miss key cases, the resulting fixes can pass the tests while leaving the original problem unsolved or creating new failures. The study examines this dynamic on the SWE-bench benchmark to quantify how often overfitting occurs. Readers should care because many current AI-driven repair tools follow this pattern and may therefore report inflated success rates.

Core claim

Test overfitting arises when resolution systems iteratively refine code against tests that were themselves derived from issue descriptions; the paper shows through experiments on SWE-bench that this process frequently yields patches passing the generated tests without addressing the underlying defect or preserving unrelated functionality.

What carries the argument

Measurement of how often generated tests accept incorrect or incomplete patches on SWE-bench tasks.

If this is right

  • Patches accepted by current systems may pass generated tests yet still contain bugs or miss the intended fix.
  • Iterative joint refinement of code and tests can lock in the overfitting.
  • Benchmark scores on SWE-bench can overstate real-world resolution quality.
  • Resolution pipelines need additional checks beyond the initial generated test set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks would be more reliable if they supplied separate verification tests not derived from the issue description.
  • Systems could reduce overfitting by adding static analysis or execution on diverse inputs after the generated tests pass.
  • The same risk likely exists in other AI coding settings that start from incomplete natural-language specifications.

Load-bearing premise

The assumption that tests auto-generated from issue text are imperfect enough to let overfitting produce solutions that are meaningfully wrong rather than just slightly incomplete.

What would settle it

A held-out test suite written independently of the issue text on which most accepted patches from current systems fail.

Figures

Figures reproduced from arXiv: 2511.16858 by Avraham Shinnar, Jatin Ganhotra, Martin Hirzel, Toufique Ahmed.

Figure 1
Figure 1. Figure 1: Test overfitting example on django__django-12308. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of test-based code refinement [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: How severe is test overfitting? (Claude-3.7-Sonnet) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coverage of the unbiased and overfitted patches. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Tests can be useful towards resolving issues on code repositories. However, relying too much on tests for issue resolution can lead to code that technically passes observed tests but actually misses important cases or even breaks functionality. This problem, called test overfitting, is exacerbated by the fact that issues usually lack readily executable tests. Instead, several issue resolution systems use tests auto-generated from issues, which may be imperfect. Some systems even iteratively refine code and tests jointly. This paper presents the first empirical study of test overfitting in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first empirical study of test overfitting in issue-resolution systems evaluated on SWE-bench. It observes that issues typically lack executable tests, that systems therefore rely on auto-generated tests which may be imperfect, and that some systems jointly refine code and tests; the study aims to measure how often patches pass the observed (generated) tests yet fail to address the intended behavior.

Significance. If the empirical measurements are robust, the work would be useful for the SWE-bench and automated program repair communities by highlighting a possible mismatch between benchmark success and actual correctness. It could motivate the inclusion of held-out oracles or differential testing in future evaluations.

major comments (2)
  1. [Methodology / Experimental Setup] The central measurement of test overfitting requires an independent way to determine whether a patch that passes the auto-generated tests is actually correct. The manuscript does not describe any held-out test suite, manual correctness labels, or coverage-based differential analysis that would separate overfitting from under-specified tests; without such an oracle the reported rates of overfitting rest on an unverified premise (see skeptic note and abstract).
  2. [Related Work / Introduction] The claim that the study is the 'first empirical study' of test overfitting in this setting is load-bearing for the contribution. The paper should explicitly compare its methodology and findings against prior work on test overfitting in APR or on SWE-bench variants that already use generated tests, to substantiate novelty.
minor comments (2)
  1. [Definitions] Clarify the exact definition of 'test overfitting' used in the quantitative results (e.g., is it patch passes generated tests but fails on a human-written test, or fails on a manually inspected intended behavior?).
  2. [Experimental Setup] Provide the full list of SWE-bench tasks, models, and test-generation methods examined so that the scope of the empirical study is reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the presentation of our methodology and to better substantiate the novelty of the work.

read point-by-point responses
  1. Referee: [Methodology / Experimental Setup] The central measurement of test overfitting requires an independent way to determine whether a patch that passes the auto-generated tests is actually correct. The manuscript does not describe any held-out test suite, manual correctness labels, or coverage-based differential analysis that would separate overfitting from under-specified tests; without such an oracle the reported rates of overfitting rest on an unverified premise (see skeptic note and abstract).

    Authors: We agree that an explicit account of how actual correctness is assessed is necessary to support the reported overfitting rates. Our evaluation uses the SWE-bench test harness (which executes the repository's own test suites) as the primary indicator of whether a patch resolves the underlying issue beyond the auto-generated tests. In addition, we conducted coverage-based differential analysis and manual inspection on a representative sample of patches to distinguish true fixes from overfitting cases. We will revise the Methodology section to include a dedicated subsection that fully describes this oracle process, the sampling procedure, and how we mitigate the risk of conflating under-specified tests with overfitting. This change will make the empirical premise transparent and address the concern directly. revision: yes

  2. Referee: [Related Work / Introduction] The claim that the study is the 'first empirical study' of test overfitting in this setting is load-bearing for the contribution. The paper should explicitly compare its methodology and findings against prior work on test overfitting in APR or on SWE-bench variants that already use generated tests, to substantiate novelty.

    Authors: We appreciate the referee's point that the novelty claim requires explicit positioning. While our work is the first to focus specifically on test overfitting arising from auto-generated tests in LLM-based issue-resolution systems evaluated on SWE-bench, we acknowledge related investigations of overfitting in classical APR. We will expand the Related Work and Introduction sections to include a systematic comparison with prior APR studies on test overfitting (e.g., those examining overfitting in search-based and semantic repair) and with any SWE-bench variants that rely on generated tests. The revised text will highlight the methodological distinctions—our emphasis on natural-language issue descriptions, joint code-and-test refinement, and large-scale empirical measurement across multiple systems—to substantiate why the present study constitutes the first targeted investigation in this setting. revision: yes

Circularity Check

0 steps flagged

Empirical analysis on external benchmark with no self-referential derivations

full rationale

This is an empirical study that applies existing methods to the external SWE-bench benchmark and reports observations about test overfitting. No equations, fitted parameters, or first-principles derivations are present in the provided abstract or description. The central claims rest on measurements against the benchmark rather than any quantity defined in terms of itself or justified solely by self-citation chains. The paper is therefore self-contained against external data with no reduction of results to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the nature of software issues and test generation rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Issues usually lack readily executable tests.
    Directly stated in the abstract as a fact that exacerbates test overfitting.
  • domain assumption Tests auto-generated from issues may be imperfect.
    Assumed in the abstract to contribute to code that passes tests but misses cases or breaks functionality.

pith-pipeline@v0.9.0 · 5379 in / 1161 out tokens · 60544 ms · 2026-05-17T19:53:19.274157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. 2025. Execution-Feedback Driven Test Generation from SWE Issues. https://arxiv.org/ abs/2508.06365

  2. [2]

    Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? https://arxiv.org/abs/2412.02883

  3. [3]

    Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, and Nagarajan Natarajan. 2024. MASAI: Modular Architecture for Software-engineering AI Agents. https://arxiv.org/ abs/2406.11638

  4. [4]

    Y.; Madry, A.; Zaremba, W.; Pachocki, J.; and Farhi, D

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. 2025. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. https://arxiv.org/abs/2503.11926

  5. [5]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests. In International Conference on Learning Representations (ICLR)

  6. [6]

    Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and Qianxiang Wang. 2024. CodeR: Issue Resolving with Multi-Agent and Task Graphs. https://arxiv.org/abs/2406. 01304

  7. [7]

    Jimenez, John Yang, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/

  8. [8]

    Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, and Azalia Mirhoseini. 2025. CodeMonkeys: Scaling Test-Time Compute for Software Engineering. https://arxiv.org/abs/2501.14723

  9. [9]

    Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling. https://arxiv.org/abs/2507.23370

  10. [10]

    Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. 2025. R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents. https://arxiv.org/abs/2504.07164

  11. [11]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

  12. [12]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation. https://arxiv.org/abs/2502.14382

  13. [13]

    Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework. InInternational Conference on Machine Learning (ICML)

  14. [14]

    KeFan Li, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, and Weifeng Lv. 2025. InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution. https://arxiv.org/abs/2511.16004

  15. [15]

    Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)

  16. [16]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. SpecRover: Code Intent Extraction via LLMs. https://arxiv.org/abs/2408.02232

  17. [17]

    Smith, Earl T

    Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. In Symposium on the Foundations of Software Engineering (FSE). https://doi.org/10. 1145/2786805.2786825

  18. [18]

    Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan. 2024. Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers. https://arxiv. org/abs/2411.17501

  19. [19]

    Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. 2025. Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning. https://arxiv.org/abs/ 2506.03136

  20. [20]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754

  21. [21]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: Agent-computer Interfaces Enable Automated Software Engineering. InConference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/ paper_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-Abstra...

  22. [22]

    Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, and Caiming Xiong. 2025. Diversity Empowers In- telligence: Integrating Expertise of Software Engineering Agents. InInternational Conference on Learning Representations (ICLR)

  23. [23]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InInternational Symposium on Software Testing and Analysis (ISSTA). 1592–1604. https://doi.org/10.1145/ 3650212.3680384