arxiv: 2510.18270 · v2 · submitted 2025-10-21 · 💻 cs.SE

Can Old Tests Do New Tricks for Resolving SWE Issues?

Yang Chen , Toufique Ahmed , Reyhaneh Jabbarvand , Martin Hirzel This is my paper

Pith reviewed 2026-05-18 05:25 UTC · model grok-4.3

classification 💻 cs.SE

keywords TestPruneregression test minimizationissue reproductionpatch validationSWE-BenchLLM-based repair agentssoftware debuggingtest suite reduction

0 comments

The pith

TestPrune automatically prunes large regression test suites to a small relevant subset that helps LLM agents reproduce new issues and validate patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TestPrune as a method that reuses existing regression tests for two new purposes: generating reproduction tests for freshly reported bugs and checking that candidate patches do not break prior functionality. Because modern agentic repair systems rely on large language models, the key step is an automatic minimization that shrinks the test suite so it fits inside context windows without adding noise or cost. When inserted into existing pipelines such as Otter, Agentless, SWE-Agent, and Trae, the minimized tests produce measurable gains on standard SWE-Bench benchmarks. The overhead remains low because the extra prompts needed for pruning are modest compared with the savings from smaller inputs.

Core claim

TestPrune is a fully automated technique that takes an issue report and an existing regression test suite, then produces a minimal subset of tests that are relevant to the reported problem; this subset can be fed to any agentic repair pipeline to improve reproduction of the new issue and to guard against regressions when a patch is applied.

What carries the argument

TestPrune, the automated minimization step that selects a compact, issue-relevant subset from a project's regression tests by combining information from the issue tracker with test execution and coverage signals.

If this is right

Issue reproduction rate rises 6.2 to 9.0 percent relative to the Otter framework baseline.
Issue resolution rate rises 8.0 to 12.9 percent relative when the method is added to Agentless, SWE-Agent, or Trae on SWE-Bench Lite and Verified.
The added model API cost stays under five cents per instance for both GPT-4o and Claude-3.7-Sonnet.
The technique can be attached to any existing agentic bug-repair pipeline without changing the agent itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pruning logic could be applied to other project artifacts such as commit histories or documentation snippets to further reduce context pressure.
The same minimization step might improve non-agentic repair tools that still struggle with oversized test inputs.
Projects with very large or loosely organized test suites may see even larger relative gains once the pruning cost is amortized.

Load-bearing premise

An automatically chosen minimal subset of regression tests will still contain the tests needed to reproduce a new issue or to detect regressions caused by a patch.

What would settle it

A benchmark instance in which the pruning step discards the single test that would have exposed the bug or the regression, causing the agent to produce a lower reproduction or resolution rate than the unpruned baseline.

Figures

Figures reproduced from arXiv: 2510.18270 by Martin Hirzel, Reyhaneh Jabbarvand, Toufique Ahmed, Yang Chen.

**Figure 1.** Figure 1: Overview of TestPrune and clients, reproduction test generation (yellow) and patch generation/selection (green), in the end-to-end pipeline of fixing open-source issues Issue Description Repository Structure Relevant Code Files Suspicious Functions [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Suspicious function localization In contrast to existing agentic program repair techniques [15, 27, 41, 45], which employs a multi-step localization strategy, TestPrune focuses on identifying tests that are most relevant to the issue. As a result, it does not attempt to further localize edits at the line level. The rationale is that fixing the same issue does not always require modifying the exact same li… view at source ↗

**Figure 3.** Figure 3: Prompt template of test file selection test files are retrieved, TestPrune collects all the passing test methods from these files and generates line-level coverage mappings to the suspicious functions. If no coverage is obtained, i.e., none of the tests exercise the suspicious functions, TestPrune reprompts the model for additional test files and repeats until non-empty coverage is collected. 3.1.3 Greedy… view at source ↗

**Figure 4.** Figure 4: Example of greedy-additional algorithm (a) and greedy-total algorithm (b). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between the original test suite and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Constructing golden regression tests #Tests x103 Verified Lite [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Statistics of golden tests Finding 2. TestPrune achieves the highest precision with an average of 0.63 and the highest coverage recall of 0.71, further confirming the effectiveness of TestPrune. 4.5 RQ3: Effectiveness of TestPrune on Issue Reproduction [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Precision and coverage recall [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Coverage of the reproduction test SWE-Bench Verified with the Claude model, the coverage increased by a negligible amount for TestPrune. Prior research shows that the coverage of a test is directly linked to the success of the test [5, 30]. We may be able to use these minimized regression tests to increase the coverage of the reproduction test, which will increase the quality of the reproduction test. We l… view at source ↗

read the original abstract

Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression tests are typically designed to ensure past functionality is preserved in the new version, they can also serve a complementary purpose: debugging the current version. Specifically, regression tests can (1) enhance the generation of reproduction tests for newly reported issues, and (2) validate that patches do not regress existing functionality. We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests. Due to the predominance of LLM-based debugging techniques, this minimization is essential as large test suites exceed context limits, introduce noise, and inflate inference costs. TestPrune can be plugged into any agentic bug repair pipeline and orthogonally improve overall performance. As a proof of concept, we show that TestPrune leads to a 6.2%-9.0% relative increase in issue reproduction rate within the Otter framework and a 8.0%-12.9% relative increase in issue resolution rate within Agentless, SWE-Agent, and Trae agent on SWE-Bench Lite and SWE-Bench Verified benchmarks. Compared to the benefits, the model API cost overhead of TestPrune is minimal, at $0.02 and $0.05 per SWE-Bench instance using GPT-4o and Claude-3.7-Sonnet models, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TestPrune, a fully automated method that uses issue tracker reports to minimize large regression test suites to a small relevant subset. This subset is then reused within agentic repair pipelines to (1) aid generation of reproduction tests for new issues and (2) validate that generated patches do not break prior functionality. Empirical evaluation on SWE-Bench Lite and Verified shows relative gains of 6.2–9.0% in reproduction rate (Otter framework) and 8.0–12.9% in resolution rate (Agentless, SWE-Agent, Trae), with negligible added API cost ($0.02–$0.05 per instance).

Significance. If the minimization heuristic reliably retains tests useful for the current issue, TestPrune would constitute a practical, low-overhead, and orthogonal improvement to existing LLM-based SWE agents. The reported gains on two established benchmarks and across multiple agents are concrete and the cost numbers are useful for practitioners.

major comments (3)

[§3 and §5] §3 (TestPrune algorithm) and §5 (experimental setup): the precise heuristic that maps issue-report text to a minimized test subset is described at a high level but lacks the concrete selection rules, similarity metric, or threshold values. Without these, it is impossible to assess whether the reported lifts could be reproduced or whether the heuristic systematically discards tests that would have been diagnostic for the new bug.
[§5.2–5.3] §5.2–5.3 (results tables): the paper reports aggregate relative improvements but provides neither per-issue breakdowns nor statistical significance tests (e.g., McNemar or bootstrap intervals). It is therefore unclear whether the 6–13% gains are driven by a small subset of favorable cases or hold across the benchmark distribution.
[§6] §6 (threats to validity): the central assumption that a regression-test subset chosen from historical issue text remains relevant for reproducing and validating fixes for newly reported issues is stated but not empirically tested. An ablation that measures reproduction success on the full suite versus the pruned suite, or a manual audit of discarded tests for a sample of SWE-Bench issues, is needed to substantiate the claim.

minor comments (2)

[Abstract and §1] The abstract and introduction use “parameter-free” and “fully automated” interchangeably; clarify whether any hyper-parameters are tuned on the evaluation set.
[Figure 2] Figure 2 (pipeline diagram) would benefit from explicit annotation of which components are new versus reused from the baseline agents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us strengthen the clarity, reproducibility, and empirical support in our presentation of TestPrune. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [§3 and §5] §3 (TestPrune algorithm) and §5 (experimental setup): the precise heuristic that maps issue-report text to a minimized test subset is described at a high level but lacks the concrete selection rules, similarity metric, or threshold values. Without these, it is impossible to assess whether the reported lifts could be reproduced or whether the heuristic systematically discards tests that would have been diagnostic for the new bug.

Authors: We agree that the original description of the heuristic was insufficiently detailed for full reproducibility. In the revised manuscript we have expanded §3 with the exact similarity metric (cosine similarity over TF-IDF vectors computed on issue-report text versus test names and docstrings), the inclusion threshold (score > 0.3), the ranking-and-selection procedure (top-10 or all tests above threshold), and accompanying pseudocode. These additions allow readers to assess whether the heuristic retains diagnostically useful tests. revision: yes
Referee: [§5.2–5.3] §5.2–5.3 (results tables): the paper reports aggregate relative improvements but provides neither per-issue breakdowns nor statistical significance tests (e.g., McNemar or bootstrap intervals). It is therefore unclear whether the 6–13% gains are driven by a small subset of favorable cases or hold across the benchmark distribution.

Authors: We acknowledge the value of finer-grained analysis. The revised version adds per-issue success/failure tables in an appendix and reports McNemar’s test p-values together with bootstrap confidence intervals on the paired outcomes, confirming that the observed relative gains are statistically significant and not attributable to a small number of outliers. revision: yes
Referee: [§6] §6 (threats to validity): the central assumption that a regression-test subset chosen from historical issue text remains relevant for reproducing and validating fixes for newly reported issues is stated but not empirically tested. An ablation that measures reproduction success on the full suite versus the pruned suite, or a manual audit of discarded tests for a sample of SWE-Bench issues, is needed to substantiate the claim.

Authors: The SWE-Bench evaluation already applies suites pruned from prior issue text to distinct new issues and measures the resulting gains in reproduction and resolution; this constitutes direct empirical evidence of relevance. To further strengthen the threats-to-validity section we have added (i) an ablation comparing reproduction success on the full versus pruned suites and (ii) a qualitative audit of discarded tests for a random sample of 20 issues, showing that low-similarity tests were appropriately excluded. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results measured on independent public benchmarks

full rationale

The paper describes an empirical technique (TestPrune) that automatically minimizes regression tests using issue reports for reproduction and validation. All reported gains (6.2-9.0% reproduction lift in Otter; 8.0-12.9% resolution lift in Agentless/SWE-Agent/Trae) are measured directly on the external SWE-Bench Lite and Verified benchmarks. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the provided text; the evaluation is external, falsifiable, and does not reduce the central claims to the pruning logic by construction. This is the normal case of a self-contained empirical SE paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from software testing literature that regression tests can be relevant to new issues and that automated selection can preserve utility; no obvious free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Regression tests from prior versions remain potentially useful for diagnosing and validating fixes for newly reported issues.
Invoked when the paper states that regression tests can enhance reproduction and validate patches.

pith-pipeline@v0.9.0 · 5837 in / 1199 out tokens · 44521 ms · 2026-05-18T05:25:23.811498+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present TestPrune, a fully automated technique that leverages issue tracker reports and strategically reuses regression tests for both bug reproduction and patch validation. A key contribution of TestPrune is its ability to automatically minimize the regression suite to a small, highly relevant subset of tests.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

https://pypi.org/project/rank-bm25/

2022. https://pypi.org/project/rank-bm25/

work page 2022
[2]

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite

2024. https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite

work page 2024
[3]

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

2024. https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

work page 2024
[4]

https://www.anthropic.com/news/claude-3-5-sonnet

2024. https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024
[5]

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Val- idate SWE Patches. InInternational Conference on Machine Learning (ICML). https://openreview.net/attachment?id=b0jYs6JOZu&name=pdf

work page 2025
[6]

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. 2025. Execution-Feedback Driven Test Generation from SWE Issues.arXiv preprint arXiv:2508.06365(2025)

work page arXiv 2025
[7]

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? https://arxiv.org/abs/2412.02883

work page arXiv 2024
[8]

Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoudhury. 2025. Unified Software Engineering agent as AI Software Engineer. https://arxiv.org/abs/2506.14683

work page arXiv 2025
[9]

Jennifer Black, Emanuel Melachrinoudis, and David Kaeli. 2004. Bi-criteria models for all-uses test suite reduction. InProceedings. 26th International Conference on Software Engineering. IEEE, 106–115

work page 2004
[10]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution.arXiv preprint arXiv:2503.12374(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao Wang. 2025. LocAgent: Graph- Guided LLM Agents for Code Localization. https://arxiv.org/abs/2503.09089

work page arXiv 2025
[12]

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google. https://arxiv.org/abs/2502.01821

work page arXiv 2025
[13]

Jimenez, John Yang, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE- bench Verified. https://openai.com/index/introducing-swe-bench-verified/

work page 2024
[14]

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, and Azalia Mirhoseini. 2025. Codemonkeys: Scaling test-time compute for software engineering.arXiv preprint arXiv:2501.14723(2025)

work page arXiv 2025
[15]

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling. https://arxiv.org/abs/2507.23370

work page arXiv 2025
[16]

Sijia Gu and Ali Mesbah. 2025. Scalable Similarity-Aware Test Suite Minimization with Reinforcement Learning.ACM Transactions on Software Engineering and Methodology34, 6 (2025), 1–23

work page 2025
[17]

Hwa-You Hsu and Alessandro Orso. 2009. MINTS: A general framework and tool for supporting test-suite minimization. In2009 IEEE 31st international conference on software engineering. IEEE, 419–429

work page 2009
[18]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Reyhaneh Jabbarvand, Alireza Sadeghi, Hamid Bagheri, and Sam Malek. 2016. Energy-aware test-suite minimization for Android apps. InInternational Sympo- sium on Software Testing and Analysis (ISSTA). 425–436. https://doi.org/10.1145/ 2931037.2931067

work page arXiv 2016
[20]

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. 2025. R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents. https://arxiv.org/abs/2504.07164

work page arXiv 2025
[21]

Dennis Jeffrey and Neelam Gupta. 2005. Test suite reduction with selective redun- dancy. In21st IEEE International Conference on Software Maintenance (ICSM’05). IEEE, 549–558

work page 2005
[22]

Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, and Zhongxin Liu

work page
[23]

https://arxiv.org/abs/2503.22424

CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching. https://arxiv.org/abs/2503.22424

work page arXiv
[24]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

work page 2024
[25]

James A Jones and Mary Jean Harrold. 2003. Test-suite reduction and prioriti- zation for modified condition/decision coverage.IEEE Transactions on software Engineering29, 3 (2003), 195–209

work page 2003
[26]

Lara Khatib, Noble Saji Mathews, and Meiyappan Nagappan. 2025. AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests.arXiv preprint arXiv:2507.17542(2025)

work page arXiv 2025
[27]

Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024. Same task, more tokens: the impact of input length on the reasoning performance of large language models. arXiv preprint arXiv:2402.14848(2024)

work page arXiv 2024
[28]

Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework. InInternational Conference on Machine Learning (ICML). https://openreview.net/forum?id=ybODpT8ydV

work page 2025
[29]

Jun-Wei Lin, Reyhaneh Jabbarvand, Joshua Garcia, and Sam Malek. 2018. Nemo: Multi-criteria test-suite minimization with integer nonlinear programming. In Proceedings of the 40th International Conference on Software Engineering. 1039– 1049

work page 2018
[30]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157

work page 1947
[31]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT- Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InCon- ference on Neural Information Processing Systems (NeurIPS)

work page 2024
[32]

Noor Nashid, Islem Bouzenia, Michael Pradel, and Ali Mesbah. 2025. Is- sue2Test: Generating Reproducing Test Cases from Issue Reports.arXiv preprint arXiv:2503.16320(2025). arXiv, October, 2025 Yang Chen, Toufique Ahmed, Reyhaneh Jabbarvand, and Martin Hirzel

work page arXiv 2025
[33]

Rongqi Pan, Taher A Ghaleb, and Lionel Briand. 2023. Atm: Black-box test case minimization based on test code similarity and evolutionary search. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1700–1711

work page 2023
[34]

Rongqi Pan, Taher A Ghaleb, and Lionel C Briand. 2024. LTM: Scalable and Black-Box Similarity-Based Test Suite Minimization Based on Language Models. IEEE Transactions on Software Engineering(2024)

work page 2024
[35]

Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, and Shafiq Joty. 2025. SweR- ank: Software Issue Localization with Code Ranking. https://arxiv.org/abs/2505. 07849

work page 2025
[36]

Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. SpecRover: Code Intent Extraction via LLMs. https://doi.org/10.1109/ICSE55347.2025.00080

work page doi:10.1109/icse55347.2025.00080 2025
[37]

Arash Vahabzadeh, Andrea Stocco, and Ali Mesbah. 2018. Fine-grained test minimization. InProceedings of the 40th International Conference on Software Engineering. 210–221

work page 2018
[38]

Shuai Wang, Shaukat Ali, and Arnaud Gotlieb. 2015. Cost-effective test suite minimization in product lines using search techniques.Journal of Systems and Software103 (2015), 370–391

work page 2015
[39]

Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. 2025. AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions. InIndustry paper at Symposium on the Foundations of Software Engineering (FSE-Industry). 331–342. https://dl.acm.org/ doi/10.1145/3696630.3728557

work page doi:10.1145/3696630.3728557 2025
[40]

Solved Issues

You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are" Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study.arXiv preprint arXiv:2503.15223(2025)

work page arXiv 2025
[41]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. 2025. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Soft- ware Evolution. https://arxiv.org/abs/2502.18449

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-based Software Engineering Agents. InSymposium on the Foun- dations of Software Engineering (FSE). 801–824. https://doi.org/10.1145/3715754

work page doi:10.1145/3715754 2025
[43]

Tao Xie and David Notkin. 2004. Checking inside the black box: Regression testing based on value spectra differences. In20th IEEE International Conference on Software Maintenance, 2004. Proceedings.IEEE, 28–37

work page 2004
[44]

Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey.Software testing, verification and reliability22, 2 (2012), 67–120

work page 2012
[45]

Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, and Jishen Zhao. 2025. OrcaLoca: An LLM Agent Framework for Software Issue Localization. https://arxiv.org/abs/2502.00350

work page arXiv 2025
[46]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- tocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

work page 2024