Can Old Tests Do New Tricks for Resolving SWE Issues?
Pith reviewed 2026-05-18 05:25 UTC · model grok-4.3
The pith
TestPrune automatically prunes large regression test suites to a small relevant subset that helps LLM agents reproduce new issues and validate patches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TestPrune is a fully automated technique that takes an issue report and an existing regression test suite, then produces a minimal subset of tests that are relevant to the reported problem; this subset can be fed to any agentic repair pipeline to improve reproduction of the new issue and to guard against regressions when a patch is applied.
What carries the argument
TestPrune, the automated minimization step that selects a compact, issue-relevant subset from a project's regression tests by combining information from the issue tracker with test execution and coverage signals.
If this is right
- Issue reproduction rate rises 6.2 to 9.0 percent relative to the Otter framework baseline.
- Issue resolution rate rises 8.0 to 12.9 percent relative when the method is added to Agentless, SWE-Agent, or Trae on SWE-Bench Lite and Verified.
- The added model API cost stays under five cents per instance for both GPT-4o and Claude-3.7-Sonnet.
- The technique can be attached to any existing agentic bug-repair pipeline without changing the agent itself.
Where Pith is reading between the lines
- Similar pruning logic could be applied to other project artifacts such as commit histories or documentation snippets to further reduce context pressure.
- The same minimization step might improve non-agentic repair tools that still struggle with oversized test inputs.
- Projects with very large or loosely organized test suites may see even larger relative gains once the pruning cost is amortized.
Load-bearing premise
An automatically chosen minimal subset of regression tests will still contain the tests needed to reproduce a new issue or to detect regressions caused by a patch.
What would settle it
A benchmark instance in which the pruning step discards the single test that would have exposed the bug or the regression, causing the agent to produce a lower reproduction or resolution rate than the unpruned baseline.
Figures
read the original abstract
Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression tests are typically designed to ensure past functionality is preserved in the new version, they can also serve a complementary purpose: debugging the current version. Specifically, regression tests can (1) enhance the generation of reproduction tests for newly reported issues, and (2) validate that patches do not regress existing functionality. We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests. Due to the predominance of LLM-based debugging techniques, this minimization is essential as large test suites exceed context limits, introduce noise, and inflate inference costs. TestPrune can be plugged into any agentic bug repair pipeline and orthogonally improve overall performance. As a proof of concept, we show that TestPrune leads to a 6.2%-9.0% relative increase in issue reproduction rate within the Otter framework and a 8.0%-12.9% relative increase in issue resolution rate within Agentless, SWE-Agent, and Trae agent on SWE-Bench Lite and SWE-Bench Verified benchmarks. Compared to the benefits, the model API cost overhead of TestPrune is minimal, at $0.02 and $0.05 per SWE-Bench instance using GPT-4o and Claude-3.7-Sonnet models, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TestPrune, a fully automated method that uses issue tracker reports to minimize large regression test suites to a small relevant subset. This subset is then reused within agentic repair pipelines to (1) aid generation of reproduction tests for new issues and (2) validate that generated patches do not break prior functionality. Empirical evaluation on SWE-Bench Lite and Verified shows relative gains of 6.2–9.0% in reproduction rate (Otter framework) and 8.0–12.9% in resolution rate (Agentless, SWE-Agent, Trae), with negligible added API cost ($0.02–$0.05 per instance).
Significance. If the minimization heuristic reliably retains tests useful for the current issue, TestPrune would constitute a practical, low-overhead, and orthogonal improvement to existing LLM-based SWE agents. The reported gains on two established benchmarks and across multiple agents are concrete and the cost numbers are useful for practitioners.
major comments (3)
- [§3 and §5] §3 (TestPrune algorithm) and §5 (experimental setup): the precise heuristic that maps issue-report text to a minimized test subset is described at a high level but lacks the concrete selection rules, similarity metric, or threshold values. Without these, it is impossible to assess whether the reported lifts could be reproduced or whether the heuristic systematically discards tests that would have been diagnostic for the new bug.
- [§5.2–5.3] §5.2–5.3 (results tables): the paper reports aggregate relative improvements but provides neither per-issue breakdowns nor statistical significance tests (e.g., McNemar or bootstrap intervals). It is therefore unclear whether the 6–13% gains are driven by a small subset of favorable cases or hold across the benchmark distribution.
- [§6] §6 (threats to validity): the central assumption that a regression-test subset chosen from historical issue text remains relevant for reproducing and validating fixes for newly reported issues is stated but not empirically tested. An ablation that measures reproduction success on the full suite versus the pruned suite, or a manual audit of discarded tests for a sample of SWE-Bench issues, is needed to substantiate the claim.
minor comments (2)
- [Abstract and §1] The abstract and introduction use “parameter-free” and “fully automated” interchangeably; clarify whether any hyper-parameters are tuned on the evaluation set.
- [Figure 2] Figure 2 (pipeline diagram) would benefit from explicit annotation of which components are new versus reused from the baseline agents.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us strengthen the clarity, reproducibility, and empirical support in our presentation of TestPrune. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§3 and §5] §3 (TestPrune algorithm) and §5 (experimental setup): the precise heuristic that maps issue-report text to a minimized test subset is described at a high level but lacks the concrete selection rules, similarity metric, or threshold values. Without these, it is impossible to assess whether the reported lifts could be reproduced or whether the heuristic systematically discards tests that would have been diagnostic for the new bug.
Authors: We agree that the original description of the heuristic was insufficiently detailed for full reproducibility. In the revised manuscript we have expanded §3 with the exact similarity metric (cosine similarity over TF-IDF vectors computed on issue-report text versus test names and docstrings), the inclusion threshold (score > 0.3), the ranking-and-selection procedure (top-10 or all tests above threshold), and accompanying pseudocode. These additions allow readers to assess whether the heuristic retains diagnostically useful tests. revision: yes
-
Referee: [§5.2–5.3] §5.2–5.3 (results tables): the paper reports aggregate relative improvements but provides neither per-issue breakdowns nor statistical significance tests (e.g., McNemar or bootstrap intervals). It is therefore unclear whether the 6–13% gains are driven by a small subset of favorable cases or hold across the benchmark distribution.
Authors: We acknowledge the value of finer-grained analysis. The revised version adds per-issue success/failure tables in an appendix and reports McNemar’s test p-values together with bootstrap confidence intervals on the paired outcomes, confirming that the observed relative gains are statistically significant and not attributable to a small number of outliers. revision: yes
-
Referee: [§6] §6 (threats to validity): the central assumption that a regression-test subset chosen from historical issue text remains relevant for reproducing and validating fixes for newly reported issues is stated but not empirically tested. An ablation that measures reproduction success on the full suite versus the pruned suite, or a manual audit of discarded tests for a sample of SWE-Bench issues, is needed to substantiate the claim.
Authors: The SWE-Bench evaluation already applies suites pruned from prior issue text to distinct new issues and measures the resulting gains in reproduction and resolution; this constitutes direct empirical evidence of relevance. To further strengthen the threats-to-validity section we have added (i) an ablation comparing reproduction success on the full versus pruned suites and (ii) a qualitative audit of discarded tests for a random sample of 20 issues, showing that low-similarity tests were appropriately excluded. revision: yes
Circularity Check
No circularity: empirical results measured on independent public benchmarks
full rationale
The paper describes an empirical technique (TestPrune) that automatically minimizes regression tests using issue reports for reproduction and validation. All reported gains (6.2-9.0% reproduction lift in Otter; 8.0-12.9% resolution lift in Agentless/SWE-Agent/Trae) are measured directly on the external SWE-Bench Lite and Verified benchmarks. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the provided text; the evaluation is external, falsifiable, and does not reduce the central claims to the pruning logic by construction. This is the normal case of a self-contained empirical SE paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Regression tests from prior versions remain potentially useful for diagnosing and validating fixes for newly reported issues.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite
2024. https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite
work page 2024
-
[3]
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
2024. https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
work page 2024
-
[4]
https://www.anthropic.com/news/claude-3-5-sonnet
2024. https://www.anthropic.com/news/claude-3-5-sonnet
work page 2024
-
[5]
Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Val- idate SWE Patches. InInternational Conference on Machine Learning (ICML). https://openreview.net/attachment?id=b0jYs6JOZu&name=pdf
work page 2025
- [6]
- [7]
- [8]
-
[9]
Jennifer Black, Emanuel Melachrinoudis, and David Kaeli. 2004. Bi-criteria models for all-uses test suite reduction. InProceedings. 26th International Conference on Software Engineering. IEEE, 106–115
work page 2004
-
[10]
Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution.arXiv preprint arXiv:2503.12374(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [11]
- [12]
-
[13]
Jimenez, John Yang, Kevin Liu, and Aleksander Madry
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/
work page 2024
- [14]
-
[15]
Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling. https://arxiv.org/abs/2507.23370
-
[16]
Sijia Gu and Ali Mesbah. 2025. Scalable Similarity-Aware Test Suite Minimization with Reinforcement Learning.ACM Transactions on Software Engineering and Methodology34, 6 (2025), 1–23
work page 2025
-
[17]
Hwa-You Hsu and Alessandro Orso. 2009. MINTS: A general framework and tool for supporting test-suite minimization. In2009 IEEE 31st international conference on software engineering. IEEE, 419–429
work page 2009
-
[18]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [19]
- [20]
-
[21]
Dennis Jeffrey and Neelam Gupta. 2005. Test suite reduction with selective redun- dancy. In21st IEEE International Conference on Software Maintenance (ICSM’05). IEEE, 549–558
work page 2005
-
[22]
Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, and Zhongxin Liu
-
[23]
https://arxiv.org/abs/2503.22424
CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching. https://arxiv.org/abs/2503.22424
-
[24]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)
work page 2024
-
[25]
James A Jones and Mary Jean Harrold. 2003. Test-suite reduction and prioriti- zation for modified condition/decision coverage.IEEE Transactions on software Engineering29, 3 (2003), 195–209
work page 2003
- [26]
- [27]
-
[28]
Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework. InInternational Conference on Machine Learning (ICML). https://openreview.net/forum?id=ybODpT8ydV
work page 2025
-
[29]
Jun-Wei Lin, Reyhaneh Jabbarvand, Joshua Garcia, and Sam Malek. 2018. Nemo: Multi-criteria test-suite minimization with integer nonlinear programming. In Proceedings of the 40th International Conference on Software Engineering. 1039– 1049
work page 2018
-
[30]
Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157
work page 1947
-
[31]
Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)
work page 2024
- [32]
-
[33]
Rongqi Pan, Taher A Ghaleb, and Lionel Briand. 2023. Atm: Black-box test case minimization based on test code similarity and evolutionary search. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1700–1711
work page 2023
-
[34]
Rongqi Pan, Taher A Ghaleb, and Lionel C Briand. 2024. LTM: Scalable and Black-Box Similarity-Based Test Suite Minimization Based on Language Models. IEEE Transactions on Software Engineering(2024)
work page 2024
-
[35]
Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, and Shafiq Joty. 2025. SweR- ank: Software Issue Localization with Code Ranking. https://arxiv.org/abs/2505. 07849
work page 2025
-
[36]
Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. SpecRover: Code Intent Extraction via LLMs. https://doi.org/10.1109/ICSE55347.2025.00080
-
[37]
Arash Vahabzadeh, Andrea Stocco, and Ali Mesbah. 2018. Fine-grained test minimization. InProceedings of the 40th International Conference on Software Engineering. 210–221
work page 2018
-
[38]
Shuai Wang, Shaukat Ali, and Arnaud Gotlieb. 2015. Cost-effective test suite minimization in product lines using search techniques.Journal of Systems and Software103 (2015), 370–391
work page 2015
-
[39]
Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. 2025. AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions. InIndustry paper at Symposium on the Foundations of Software Engineering (FSE-Industry). 331–342. https://dl.acm.org/ doi/10.1145/3696630.3728557
-
[40]
You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are" Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study.arXiv preprint arXiv:2503.15223(2025)
-
[41]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. 2025. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Soft- ware Evolution. https://arxiv.org/abs/2502.18449
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754
-
[43]
Tao Xie and David Notkin. 2004. Checking inside the black box: Regression testing based on value spectra differences. In20th IEEE International Conference on Software Maintenance, 2004. Proceedings.IEEE, 28–37
work page 2004
-
[44]
Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey.Software testing, verification and reliability22, 2 (2012), 67–120
work page 2012
- [45]
-
[46]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- tocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.