PR-Aware Automated Unit Test Generation: Challenges and Opportunities

Atakan Akar; Berk \c{C}akar; Eray T\"uz\"un; Vahid Haratian

arxiv: 2605.25285 · v1 · pith:FN35LMVZnew · submitted 2026-05-24 · 💻 cs.SE

PR-Aware Automated Unit Test Generation: Challenges and Opportunities

Vahid Haratian , Atakan Akar , Berk \c{C}akar , Eray T\"uz\"un This is my paper

Pith reviewed 2026-06-29 23:18 UTC · model grok-4.3

classification 💻 cs.SE

keywords pull requestsautomated test generationfail-to-pass testsEvoSuiteGPT-4osoftware evolutionunit testing

0 comments

The pith

Current test generators succeed at capturing pull request changes for only a minority of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how well automated test generation tools perform when the goal is to test the specific changes made in a pull request rather than an entire software unit. It uses the fail-to-pass metric to check if generated tests detect the before-and-after difference in the code. The evaluation reveals that both a search-based tool and a large language model fall short on the majority of real pull requests from open source projects, indicating that standard approaches do not align well with the incremental way software is developed today.

Core claim

The evaluation of EvoSuite and GPT-4o on a set of pull requests shows that EvoSuite produces at least one fail-to-pass test for 36 percent of the PRs while GPT-4o does so for 13 percent. Both tools fail to generate any such tests for 64 percent of the PRs, with GPT-4o additionally facing a 63 percent rate of compilation errors in its outputs.

What carries the argument

The fail-to-pass (F2P) test, defined as a test that fails on the version of the code before the pull request change and passes on the version after the change.

If this is right

EvoSuite outperforms GPT-4o at producing change-specific tests.
Both approaches leave most pull requests without meaningful tests.
Compilation errors severely limit the usefulness of LLM-generated tests.
New methods tailored to incremental changes are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agentic approaches to code generation may overcome the current limitations.
Extending the evaluation to additional programming languages or change types could test the generality of the findings.
Alternative metrics for test usefulness might complement the F2P criterion.

Load-bearing premise

The collection of pull requests studied and the choice of fail-to-pass success as the key measure reflect the broader problem of generating tests for typical code changes.

What would settle it

A demonstration that a different generator produces fail-to-pass tests for a majority of the same pull requests would undermine the conclusion that current tools are largely ineffective.

Figures

Figures reproduced from arXiv: 2605.25285 by Atakan Akar, Berk \c{C}akar, Eray T\"uz\"un, Vahid Haratian.

**Figure 3.** Figure 3: Evaluation Results for EvoSuite understand the performance without the influence of these errors, we performed a second round of calculations. In this subsequent analysis, we recalculated the F2P percentage after excluding the error cases from both generators. For a deeper analysis to answer RQ3, we also examine how the number of files in a PR affects the generation of useful (F2P) tests. To do this, we gr… view at source ↗

**Figure 5.** Figure 5: EvoSuite’s PR performance broken down by the number of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Evaluation Results for GPT-4o EvoSuite Evaluation Although EvoSuite achieved a high compilability rate of 98%, 62% of the generated tests were not containing F2P for the PRs, and only 36% contained at least one F2P test case (37% if excluding errors). B. RQ2: Is GPT-4o effective at generating F2P tests cases for PRs? The evaluation results for GPT-4o are summarized in [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 6.** Figure 6: EvoSuite’s success rate in different projects broken down [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: GPT-4o’s success rate in different projects broken down by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: GPT-4o’s PR performance broken down by the number of [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Automated test generation has a substantial body of work, yet most studies focus on generating tests for complete software units, such as classes, and rely on metrics such as code coverage for assessment. In contrast, modern software development primarily evolves through small, targeted changes introduced in pull requests (PRs). Despite this, the crucial task of generating tests specifically for these PRs has been overlooked, and the performance of state-of-the-art tools for this purpose remains unknown. This study evaluates two distinct approaches for PR-aware test generation: EvoSuite, a leading search-based tool, and GPT-4o, one of the widely used large language models (LLMs). To measure their effectiveness at validating PR-specific changes, we assess their ability to generate fail-to-pass (F2P) test cases, meaning tests that fail on the code before the change and pass on the code after the change. Our evaluation shows that EvoSuite outperformed GPT-4o, producing at least one F2P test for a significantly higher percentage of PRs (36 percent vs. 13 percent). The performance of GPT-4o was significantly hampered by a high rate of compilation errors (63 percent), whereas only 2 percent of EvoSuite's generated tests failed to run. Despite EvoSuite's relative success, our findings indicate that both tools are largely ineffective for this task, as they failed to generate any meaningful change-capturing tests for the large majority of the PRs (64 percent). Although both generators could not achieve a high F2P ratio in our evaluation, and EvoSuite outperformed GPT-4o, we believe that agentic code generation methods may have significant potential for this task. Ultimately, our work highlights a critical gap in tooling and calls for the development of high-performance test generators tailored to the incremental nature of modern software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows current tools struggle with PR-specific F2P tests but the 64% ineffectiveness claim is hard to interpret without knowing the share of non-behavioral changes in the dataset.

read the letter

The main point to take away is that EvoSuite produced at least one fail-to-pass test for 36% of the PRs while GPT-4o managed only 13%, leaving both tools unable to generate any change-capturing test for 64% of cases. This is a straightforward empirical observation that flags a tooling gap for incremental development.

What the paper does well is move the evaluation target from whole-unit coverage to the actual delta introduced by a PR. The F2P metric is a clean way to check whether a generated test validates the change rather than just exercising old code. The direct head-to-head on real PRs, plus the compilation error rates (63% for GPT-4o, 2% for EvoSuite), gives readers usable numbers on current limits.

The soft spot is the one raised in the stress-test note. The abstract supplies no breakdown of PR change types and no statement that every PR alters observable behavior in a way a unit test can capture. If a non-trivial fraction of the corpus consists of documentation edits, config tweaks, or pure refactors, then failure to produce an F2P test is expected rather than evidence of tool weakness. Without that filter or category table, the headline ineffectiveness percentage is difficult to read. Dataset size, selection criteria, and any statistical checks are also absent from the abstract, though the full paper may contain them.

This is the sort of targeted empirical note that belongs in a software engineering venue. People working on test generation, CI pipelines, or LLM code tools would find the gap identification and the relative performance numbers worth seeing. It is not a deep theoretical advance, but the evaluation design is honest and the call for PR-aware generators is reasonable.

I would send it to peer review. The central comparison is worth referee scrutiny once the PR composition details are verified.

Referee Report

2 major / 2 minor

Summary. The paper evaluates two tools—EvoSuite (search-based) and GPT-4o (LLM)—for PR-aware automated test generation. It measures success via the ability to produce fail-to-pass (F2P) tests that fail before a PR change and pass after. On an unspecified corpus of PRs, EvoSuite generates at least one F2P test for 36% of PRs while GPT-4o succeeds for 13%; both fail for 64% of PRs. The work concludes that current tools are largely ineffective for incremental PR changes, notes GPT-4o's high compilation error rate (63%), and suggests agentic methods as a promising direction for future PR-tailored generators.

Significance. If the evaluation design is sound, the concrete success rates provide direct empirical evidence of a tooling gap between whole-class test generation and the incremental changes typical of modern PR-driven development. The side-by-side comparison of a mature search-based tool against a current LLM, together with the F2P metric, offers a reproducible baseline that future work can build upon. The explicit call for PR-specific generators is a useful framing even if the absolute ineffectiveness numbers require clarification.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: The central claim that both tools are 'largely ineffective' because they produce no F2P test for 64% of PRs is load-bearing on the premise that every selected PR modifies executable behavior in a way that an F2P test can capture. No breakdown of change types (behavioral modification vs. documentation, comments, configuration, or pure refactoring) or filtering criteria is reported, so it is impossible to determine what fraction of the 64% failures are expected rather than evidence of tool limitation.
[Abstract] Abstract: The reported percentages (36% vs. 13% success, 64% overall failure, 63% GPT-4o compilation errors) are presented without accompanying dataset size, selection criteria, statistical tests, or baseline comparisons. This makes it difficult to assess whether the observed differences are robust or whether the PR corpus adequately represents the space of incremental changes across projects and languages.

minor comments (2)

[Abstract] The abstract states that 'agentic code generation methods may have significant potential' without any supporting experiment or reference; this forward-looking claim could be moved to the conclusion or qualified as speculation.
[Abstract] Notation for F2P is introduced without an explicit definition or example in the provided abstract text; a short formal definition would improve clarity for readers unfamiliar with the metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our evaluation design. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The central claim that both tools are 'largely ineffective' because they produce no F2P test for 64% of PRs is load-bearing on the premise that every selected PR modifies executable behavior in a way that an F2P test can capture. No breakdown of change types (behavioral modification vs. documentation, comments, configuration, or pure refactoring) or filtering criteria is reported, so it is impossible to determine what fraction of the 64% failures are expected rather than evidence of tool limitation.

Authors: We agree that an explicit breakdown of PR change types is necessary to strengthen the interpretation of the 64% failure rate. The current manuscript selects PRs involving code modifications from Java projects but does not classify them by type. In the revised version, we will add a classification of change types (behavioral modifications, refactoring, documentation, etc.) in the Evaluation section, along with F2P rates stratified by category and a clearer statement of the filtering criteria. This will clarify what portion of failures may be attributable to non-behavioral changes. revision: yes
Referee: [Abstract] Abstract: The reported percentages (36% vs. 13% success, 64% overall failure, 63% GPT-4o compilation errors) are presented without accompanying dataset size, selection criteria, statistical tests, or baseline comparisons. This makes it difficult to assess whether the observed differences are robust or whether the PR corpus adequately represents the space of incremental changes across projects and languages.

Authors: The Evaluation section provides the dataset size, selection criteria (PRs from multiple open-source Java repositories with compilable changes), and project details. To improve accessibility, we will revise the abstract to include the corpus size and a concise description of selection criteria. The study is exploratory and does not include formal statistical tests; we will add a clarifying sentence on this point. The direct comparison of EvoSuite and GPT-4o constitutes the primary baseline. These updates will be incorporated. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation study with no derivations or self-referential reductions

full rationale

The paper is a direct empirical measurement of two tools (EvoSuite and GPT-4o) on an external corpus of PRs using the F2P metric. No equations, fitted parameters, ansatzes, uniqueness theorems, or self-citations appear in the abstract or framing. All reported percentages (36%, 13%, 64%, 63%, 2%) are raw counts from running the tools on the dataset, not reductions of any claimed derivation to its own inputs. The evaluation is therefore self-contained against external benchmarks and exhibits no circularity of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that F2P tests are the right proxy for PR-aware effectiveness and that the chosen PR corpus is representative; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Fail-to-pass tests are a valid and sufficient measure of a generator's ability to validate PR-specific changes.
The abstract defines success exclusively via F2P tests without justifying why other metrics (e.g., mutation score on the diff) would not be needed.
domain assumption The PRs studied are representative of typical incremental changes in open-source projects.
The abstract presents aggregate percentages without describing how the PR corpus was sampled or whether project/language diversity was controlled.

pith-pipeline@v0.9.1-grok · 5882 in / 1458 out tokens · 23743 ms · 2026-06-29T23:18:17.601555+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Evosuite: Automatic test suite generation for object-oriented software,

G. Fraser and A. Arcuri, “Evosuite: Automatic test suite generation for object-oriented software,” inProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11), 2011, pp. 416–419

2011
[2]

Evosuite at the sbst 2017 tool competition,

G. Fraser, J. M. Rojas, J. Campos, and A. Arcuri, “Evosuite at the sbst 2017 tool competition,” in10th International Workshop on Search-Based Software Testing (SBST 2017) at ICSE, 2017, pp. 39–42

2017
[3]

Evosuite at the sbst 2018 tool competition,

G. Fraser, J. M. Rojas, and A. Arcuri, “Evosuite at the sbst 2018 tool competition,” inProceedings of the 11th International Workshop on Search-Based Software Testing (SBST 2018), 2018, pp. 41–47

2018
[4]

Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,

S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,” inProceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2015, pp. 201–211

2015
[5]

Using relative lines of code to guide automated test gen- eration for python,

J. Holmes, I. Ahmed, C. Brindescu, R. Gopinath, H. Zhang, and A. Groce, “Using relative lines of code to guide automated test gen- eration for python,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 4, pp. 1–38, 2020

2020
[6]

Pull request decisions explained: An empirical overview,

X. Zhang, Y . Yu, G. Gousios, and A. Rastogi, “Pull request decisions explained: An empirical overview,”IEEE Transactions on Software Engineering, vol. 49, no. 2, pp. 849–871, 2023

2023
[7]

Engineering fundamentals playbook,

M. Corporation, “Engineering fundamentals playbook,” https://microsof t.github.io/code-with-engineering-playbook/, 2024, accessed: October 14, 2025

2024
[8]

Commit-aware mutation testing,

W. Ma, T. Laurent, M. Ojdani ´c, T. T. Chekam, A. Ventresque, and M. Pa- padakis, “Commit-aware mutation testing,” in2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020, pp. 394–405

2020
[9]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Issue2test: Generating reproducing test cases from issue reports,

N. Nashid, I. Bouzenia, M. Pradel, and A. Mesbah, “Issue2test: Generating reproducing test cases from issue reports,”arXiv preprint arXiv:2503.16320, 2025

work page arXiv 2025
[11]

When automated program repair meets regression testing—an extensive study on two million patches,

Y . Lou, J. Yang, S. Benton, D. Hao, L. Tan, Z. Chen, L. Zhang, and L. Zhang, “When automated program repair meets regression testing—an extensive study on two million patches,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 7, Sep. 2024. [Online]. Available: https://doi.org/10.1145/3672450

work page doi:10.1145/3672450 2024
[12]

Lwdiff: an llm-assisted differential testing framework for webassembly runtimes,

S. Zhou, J. Wang, H. Ye, H. Zhou, C. Le Goues, and X. Luo, “Lwdiff: an llm-assisted differential testing framework for webassembly runtimes,” in2025 IEEE/ACM 47th International Conference on Software Engi- neering (ICSE). IEEE Computer Society, 2025, pp. 769–769

2025
[13]

Coverage-directed differential testing of jvm implementations,

Y . Chen, T. Su, C. Sun, Z. Su, and J. Zhao, “Coverage-directed differential testing of jvm implementations,” inproceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, pp. 85–99

2016
[14]

Regression testing minimization, selection and prioritization: a survey,

S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,”Software testing, verification and reliability, vol. 22, no. 2, pp. 67–120, 2012

2012
[15]

T-evos: A large-scale longitudinal study on ci test execution and failure,

A. R. Chen, T.-H. P. Chen, and S. Wang, “T-evos: A large-scale longitudinal study on ci test execution and failure,”IEEE Transactions on Software Engineering, vol. 49, no. 4, pp. 2352–2364, Apr. 2023

2023
[16]

Finerts: Fine-grained retrieval- augmented test selection,

Y . Liu, M. Gligoric, and P. Nie, “Finerts: Fine-grained retrieval- augmented test selection,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023). IEEE, 2023, pp. 129–141. [Online]. Available: https: //pengyunie.github.io/p/LiuETAL23FineRTS.pdf

2023
[17]

More precise regression test selection via reasoning about semantics-modifying changes,

Y . Liu, J. Zhang, P. Nie, M. Gligoric, and O. Legunsen, “More precise regression test selection via reasoning about semantics-modifying changes,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 664–676. [Online]. Available: h...

work page doi:10.1145/3597926.3598086 2023
[18]

T-evos: A large-scale longitudinal study on ci test execution and failure,

A. Trautsch, P. Surendra, A. Gusarova, M. Fischer, S. Apel, A. van Hoorn, and L. Grunske, “T-evos: A large-scale longitudinal study on ci test execution and failure,”IEEE Transactions on Software Engineering, vol. 49, no. 4, pp. 1945–1967, April 2023

1945
[19]

Randoop: feedback-directed random testing for java,

C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” inCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816

2007
[20]

Sbst tool competition 2021,

S. Panichella, A. Gambi, F. Zampetti, and V . Riccio, “Sbst tool competition 2021,” inProceedings of the IEEE/ACM 43rd International Conference on Software Engineering Workshops (SBST@ICSE 2021). IEEE/ACM, 2021, pp. 419–430. [Online]. Available: https://doi.org/10 .1109/ICSE-Companion.2021.00101

work page arXiv 2021
[21]

Empirical comparison between conven- tional and ai-based automated unit test generation tools in java,

M. Gkikopouli and B. Bataa, “Empirical comparison between conven- tional and ai-based automated unit test generation tools in java,” 2023

2023
[22]

Testforge: Feedback-driven, agentic test suite generation,

K. Jain and C. L. Goues, “Testforge: Feedback-driven, agentic test suite generation,”arXiv preprint arXiv:2503.14713, 2025

work page arXiv 2025
[23]

Openhands: An open platform for AI software developers as generalist agents,

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Openhands: An open platform for AI software developers as generalist agents,” inThe Thirteenth International Conference on Le...

2025
[24]

Test wars: A com- parative study of sbst, symbolic execution, and llm-based approaches to unit test generation,

A. Abdullin, P. Derakhshanfar, and A. Panichella, “Test wars: A com- parative study of sbst, symbolic execution, and llm-based approaches to unit test generation,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2025, pp. 221–232

2025
[25]

Multi-SWE-Bench: A Multi-Language Benchmark for Evaluating Large Language Models on Software Maintenance Tasks,

X. Wang, M. Prabhu, D. Narayanan, S. Yao, H. Chen, S. Narayan, M. Zhang, X. Li, and D. Chen, “Multi-SWE-Bench: A Multi-Language Benchmark for Evaluating Large Language Models on Software Maintenance Tasks,”arXiv preprint arXiv:2407.05530, 2024. [Online]. Available: https://arxiv.org/abs/2407.05530

work page arXiv 2024
[26]

Defects4j: A database of existing faults to enable controlled testing studies for java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inPro- ceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440

2014
[27]

Feedback-directed random test generation,

C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” inProceedings of the 29th International Con- ference on Software Engineering (ICSE ’07). IEEE, 2007, pp. 75–84

2007
[28]

How do automatically generated unit tests influence software main- tenance?

S. Shamshiri, J. M. Rojas, J. P. Galeotti, N. Walkinshaw, and G. Fraser, “How do automatically generated unit tests influence software main- tenance?” in2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 2018, pp. 250–261

2018
[29]

Automatic test case generation: What if test code quality matters?

F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. D. Lucia, “Automatic test case generation: What if test code quality matters?” inProceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA 2016). New York, NY , USA: Association for Computing Machinery, 2016, pp. 130–141. [Online]. Available: https://doi.org/10.1145/29...

work page doi:10.1145/2931037.2931057 2016
[30]

Toward granular search-based automatic unit test case generation,

F. Pecorelli, G. Grano, F. Palomba, and A. De Lucia, “Toward granular search-based automatic unit test case generation,”Empirical Software Engineering, vol. 29, no. 71, 2024. [Online]. Available: https://doi.org/10.1007/s10664-024-10451-x

work page doi:10.1007/s10664-024-10451-x 2024
[31]

An empirical study of unit test gen- eration with large language models,

L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wang, and J. Chen, “An empirical study of unit test gen- eration with large language models,”arXiv preprint arXiv:2406.18181, 2024

work page arXiv 2024
[32]

Testspark: Intellij idea’s ultimate test generation companion,

A. Sapozhnikov, M. Olsthoorn, A. Panichella, V . Kovalenko, and P. Derakhshanfar, “Testspark: Intellij idea’s ultimate test generation companion,” inProceedings of the 2024 ACM/IEEE 46th International Conference on Software Engineering: Companion Proceedings (ICSE Companion 2024), 2024, pp. 30–34

2024
[33]

No more manual tests? evaluating and improving chatgpt for unit test generation,

Z. Yuan, Y . Lou, M. Liu, S. Ding, K. Wang, Y . Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,”arXiv preprint arXiv:2305.04207, 2023

work page arXiv 2023
[34]

Critical Code Guided Directed Greybox Fuzzing for Commits,

Y . Xiang, X. Zhang, P. Liu, S. Ji, X. Xiao, H. Liang, J. Xu, and W. Wang, “Critical Code Guided Directed Greybox Fuzzing for Commits,”Pro- ceedings of the 33rd USENIX Security Symposium, 2024

2024
[35]

Can LLM Generate Regression Tests for Software Commits?

J. Liu, S. Lee, E. Losiouk, and M. Böhme, “Can LLM Generate Regression Tests for Software Commits?” arXiv preprint arXiv:2501.11086, 2025. [Online]. Available: https://arxiv.org/abs/2501.11086

work page arXiv 2025
[36]

Sikand, V

X. Hu, Z. Liu, X. Xia, Z. Liu, T. Xu, and X. Yang, “Identify and update test cases when production code changes: A transformer- based approach,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’23. IEEE Press, 2024, p. 1111–1122. [Online]. Available: https://doi.org/10.1109/ ASE56229.2023.00165

work page arXiv 2024
[37]

SWE-smith: Scaling Data for Software Engineering Agents

J. Yang, K. Lieret, C. E. Jimenez, A. Wettig, K. Khandpur, Y . Zhang, B. Hui, O. Press, L. Schmidt, and D. Yang, “Swe-smith: Scaling data for software engineering agents,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21798

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Swe-bench-java: A github issue resolving benchmark for java,

D. Zan, Z. Huang, A. Yu, S. Lin, Y . Shi, W. Liu, D. Chen, Z. Qi, H. Yu, L. Yuet al., “Swe-bench-java: A github issue resolving benchmark for java,”arXiv preprint arXiv:2408.14354, 2024

work page arXiv 2024
[39]

(2025) Welcome to The Apache Software Foundation

The Apache Software Foundation. (2025) Welcome to The Apache Software Foundation. Accessed 27 August 2025. [Online]. Available: https://www.apache.org/

2025
[40]

Apache maven,

——, “Apache maven,” https://maven.apache.org/, 2025, version 3.9.9, accessed 2025-10-14

2025
[41]

(2021) Differential / regression test generation (evosuiter)

EvoSuite Project. (2021) Differential / regression test generation (evosuiter). Tutorial page. [Online]. Available: https://www.evosuite.org /evosuiter/

2021
[42]

Evosuite 1.1.0 does not recongnize -regressionsuite parameter

leofernandesmo, “Evosuite 1.1.0 does not recongnize -regressionsuite parameter.” https://github.com/EvoSuite/evosuite/issues/353, 4 2021, issue #353 inEvoSuite/evosuiteon GitHub

2021
[43]

Gpt-4o system card,

OpenAI, “Gpt-4o system card,” 8 2024, model system card. [Online]. Available: https://cdn.openai.com/gpt-4o-system-card.pdf

2024
[44]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS 2022), ser. NeurIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022. [Online]. Av...

2022
[45]

(2025) Maven surefire plugin

Apache Maven Project. (2025) Maven surefire plugin. Official documentation. [Online]. Available: https://maven.apache.org/surefire/ maven-surefire-plugin/

2025
[46]

Com- bining multiple coverage criteria in search-based unit test generation,

J. M. Rojas, J. Campos, M. Vivanti, G. Fraser, and A. Arcuri, “Com- bining multiple coverage criteria in search-based unit test generation,” inInternational Symposium on Search Based Software Engineering. Springer, 2015, pp. 93–108

2015
[47]

Yate: The role of test repair in llm-based unit test generation,

M. Konstantinou, R. Degiovanni, J. M. Zhang, M. Harman, and M. Papadakis, “Yate: The role of test repair in llm-based unit test generation,” 2025. [Online]. Available: https://arxiv.org/abs/2507.18316

work page arXiv 2025
[48]

TestEval: Benchmarking large language models for test case generation,

W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “TestEval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp...

2025
[49]

Advancing code coverage: Incorporating program analysis with large language models,

C. Yang, J. Chen, B. Lin, Z. Wang, and J. Zhou, “Advancing code coverage: Incorporating program analysis with large language models,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 1, no. 1, pp. 1–31, Jan. 2025. [Online]. Available: https://doi.org/10.1145/3748505

work page doi:10.1145/3748505 2025
[50]

SWE-Agent: Agentic software engineering with tool use,

Z. Li, R. Bommasani, D. Chen, and P. Liang, “SWE-Agent: Agentic software engineering with tool use,” inProc. 2024 Conf. Empirical Meth- ods in Natural Language Processing (EMNLP 2024), 2024, available at https://github.com/princeton-nlp/SWE-agent

2024
[51]

(2025) Cline: Autonomous coding agent for VS Code

Cline Development Team. (2025) Cline: Autonomous coding agent for VS Code. Accessed: Oct. 12, 2025. [Online]. Available: https://marketplace.visualstudio.com/items?itemName=clinebcl.cline

2025
[52]

(2025) Cursor: The ai code editor

Cursor AI. (2025) Cursor: The ai code editor. Accessed: Oct. 12, 2025. [Online]. Available: https://www.cursor.com

2025
[53]

(2025) Windsurf: Multi-agent coding and large-scale software workflows

Windsurf AI. (2025) Windsurf: Multi-agent coding and large-scale software workflows. Accessed: Oct. 12, 2025. [Online]. Available: https://windsurf.ai

2025

[1] [1]

Evosuite: Automatic test suite generation for object-oriented software,

G. Fraser and A. Arcuri, “Evosuite: Automatic test suite generation for object-oriented software,” inProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11), 2011, pp. 416–419

2011

[2] [2]

Evosuite at the sbst 2017 tool competition,

G. Fraser, J. M. Rojas, J. Campos, and A. Arcuri, “Evosuite at the sbst 2017 tool competition,” in10th International Workshop on Search-Based Software Testing (SBST 2017) at ICSE, 2017, pp. 39–42

2017

[3] [3]

Evosuite at the sbst 2018 tool competition,

G. Fraser, J. M. Rojas, and A. Arcuri, “Evosuite at the sbst 2018 tool competition,” inProceedings of the 11th International Workshop on Search-Based Software Testing (SBST 2018), 2018, pp. 41–47

2018

[4] [4]

Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,

S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,” inProceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2015, pp. 201–211

2015

[5] [5]

Using relative lines of code to guide automated test gen- eration for python,

J. Holmes, I. Ahmed, C. Brindescu, R. Gopinath, H. Zhang, and A. Groce, “Using relative lines of code to guide automated test gen- eration for python,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 4, pp. 1–38, 2020

2020

[6] [6]

Pull request decisions explained: An empirical overview,

X. Zhang, Y . Yu, G. Gousios, and A. Rastogi, “Pull request decisions explained: An empirical overview,”IEEE Transactions on Software Engineering, vol. 49, no. 2, pp. 849–871, 2023

2023

[7] [7]

Engineering fundamentals playbook,

M. Corporation, “Engineering fundamentals playbook,” https://microsof t.github.io/code-with-engineering-playbook/, 2024, accessed: October 14, 2025

2024

[8] [8]

Commit-aware mutation testing,

W. Ma, T. Laurent, M. Ojdani ´c, T. T. Chekam, A. Ventresque, and M. Pa- padakis, “Commit-aware mutation testing,” in2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020, pp. 394–405

2020

[9] [9]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Issue2test: Generating reproducing test cases from issue reports,

N. Nashid, I. Bouzenia, M. Pradel, and A. Mesbah, “Issue2test: Generating reproducing test cases from issue reports,”arXiv preprint arXiv:2503.16320, 2025

work page arXiv 2025

[11] [11]

When automated program repair meets regression testing—an extensive study on two million patches,

Y . Lou, J. Yang, S. Benton, D. Hao, L. Tan, Z. Chen, L. Zhang, and L. Zhang, “When automated program repair meets regression testing—an extensive study on two million patches,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 7, Sep. 2024. [Online]. Available: https://doi.org/10.1145/3672450

work page doi:10.1145/3672450 2024

[12] [12]

Lwdiff: an llm-assisted differential testing framework for webassembly runtimes,

S. Zhou, J. Wang, H. Ye, H. Zhou, C. Le Goues, and X. Luo, “Lwdiff: an llm-assisted differential testing framework for webassembly runtimes,” in2025 IEEE/ACM 47th International Conference on Software Engi- neering (ICSE). IEEE Computer Society, 2025, pp. 769–769

2025

[13] [13]

Coverage-directed differential testing of jvm implementations,

Y . Chen, T. Su, C. Sun, Z. Su, and J. Zhao, “Coverage-directed differential testing of jvm implementations,” inproceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, pp. 85–99

2016

[14] [14]

Regression testing minimization, selection and prioritization: a survey,

S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,”Software testing, verification and reliability, vol. 22, no. 2, pp. 67–120, 2012

2012

[15] [15]

T-evos: A large-scale longitudinal study on ci test execution and failure,

A. R. Chen, T.-H. P. Chen, and S. Wang, “T-evos: A large-scale longitudinal study on ci test execution and failure,”IEEE Transactions on Software Engineering, vol. 49, no. 4, pp. 2352–2364, Apr. 2023

2023

[16] [16]

Finerts: Fine-grained retrieval- augmented test selection,

Y . Liu, M. Gligoric, and P. Nie, “Finerts: Fine-grained retrieval- augmented test selection,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023). IEEE, 2023, pp. 129–141. [Online]. Available: https: //pengyunie.github.io/p/LiuETAL23FineRTS.pdf

2023

[17] [17]

More precise regression test selection via reasoning about semantics-modifying changes,

Y . Liu, J. Zhang, P. Nie, M. Gligoric, and O. Legunsen, “More precise regression test selection via reasoning about semantics-modifying changes,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 664–676. [Online]. Available: h...

work page doi:10.1145/3597926.3598086 2023

[18] [18]

T-evos: A large-scale longitudinal study on ci test execution and failure,

A. Trautsch, P. Surendra, A. Gusarova, M. Fischer, S. Apel, A. van Hoorn, and L. Grunske, “T-evos: A large-scale longitudinal study on ci test execution and failure,”IEEE Transactions on Software Engineering, vol. 49, no. 4, pp. 1945–1967, April 2023

1945

[19] [19]

Randoop: feedback-directed random testing for java,

C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” inCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816

2007

[20] [20]

Sbst tool competition 2021,

S. Panichella, A. Gambi, F. Zampetti, and V . Riccio, “Sbst tool competition 2021,” inProceedings of the IEEE/ACM 43rd International Conference on Software Engineering Workshops (SBST@ICSE 2021). IEEE/ACM, 2021, pp. 419–430. [Online]. Available: https://doi.org/10 .1109/ICSE-Companion.2021.00101

work page arXiv 2021

[21] [21]

Empirical comparison between conven- tional and ai-based automated unit test generation tools in java,

M. Gkikopouli and B. Bataa, “Empirical comparison between conven- tional and ai-based automated unit test generation tools in java,” 2023

2023

[22] [22]

Testforge: Feedback-driven, agentic test suite generation,

K. Jain and C. L. Goues, “Testforge: Feedback-driven, agentic test suite generation,”arXiv preprint arXiv:2503.14713, 2025

work page arXiv 2025

[23] [23]

Openhands: An open platform for AI software developers as generalist agents,

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Openhands: An open platform for AI software developers as generalist agents,” inThe Thirteenth International Conference on Le...

2025

[24] [24]

Test wars: A com- parative study of sbst, symbolic execution, and llm-based approaches to unit test generation,

A. Abdullin, P. Derakhshanfar, and A. Panichella, “Test wars: A com- parative study of sbst, symbolic execution, and llm-based approaches to unit test generation,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2025, pp. 221–232

2025

[25] [25]

Multi-SWE-Bench: A Multi-Language Benchmark for Evaluating Large Language Models on Software Maintenance Tasks,

X. Wang, M. Prabhu, D. Narayanan, S. Yao, H. Chen, S. Narayan, M. Zhang, X. Li, and D. Chen, “Multi-SWE-Bench: A Multi-Language Benchmark for Evaluating Large Language Models on Software Maintenance Tasks,”arXiv preprint arXiv:2407.05530, 2024. [Online]. Available: https://arxiv.org/abs/2407.05530

work page arXiv 2024

[26] [26]

Defects4j: A database of existing faults to enable controlled testing studies for java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inPro- ceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440

2014

[27] [27]

Feedback-directed random test generation,

C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” inProceedings of the 29th International Con- ference on Software Engineering (ICSE ’07). IEEE, 2007, pp. 75–84

2007

[28] [28]

How do automatically generated unit tests influence software main- tenance?

S. Shamshiri, J. M. Rojas, J. P. Galeotti, N. Walkinshaw, and G. Fraser, “How do automatically generated unit tests influence software main- tenance?” in2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 2018, pp. 250–261

2018

[29] [29]

Automatic test case generation: What if test code quality matters?

F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. D. Lucia, “Automatic test case generation: What if test code quality matters?” inProceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA 2016). New York, NY , USA: Association for Computing Machinery, 2016, pp. 130–141. [Online]. Available: https://doi.org/10.1145/29...

work page doi:10.1145/2931037.2931057 2016

[30] [30]

Toward granular search-based automatic unit test case generation,

F. Pecorelli, G. Grano, F. Palomba, and A. De Lucia, “Toward granular search-based automatic unit test case generation,”Empirical Software Engineering, vol. 29, no. 71, 2024. [Online]. Available: https://doi.org/10.1007/s10664-024-10451-x

work page doi:10.1007/s10664-024-10451-x 2024

[31] [31]

An empirical study of unit test gen- eration with large language models,

L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wang, and J. Chen, “An empirical study of unit test gen- eration with large language models,”arXiv preprint arXiv:2406.18181, 2024

work page arXiv 2024

[32] [32]

Testspark: Intellij idea’s ultimate test generation companion,

A. Sapozhnikov, M. Olsthoorn, A. Panichella, V . Kovalenko, and P. Derakhshanfar, “Testspark: Intellij idea’s ultimate test generation companion,” inProceedings of the 2024 ACM/IEEE 46th International Conference on Software Engineering: Companion Proceedings (ICSE Companion 2024), 2024, pp. 30–34

2024

[33] [33]

No more manual tests? evaluating and improving chatgpt for unit test generation,

Z. Yuan, Y . Lou, M. Liu, S. Ding, K. Wang, Y . Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,”arXiv preprint arXiv:2305.04207, 2023

work page arXiv 2023

[34] [34]

Critical Code Guided Directed Greybox Fuzzing for Commits,

Y . Xiang, X. Zhang, P. Liu, S. Ji, X. Xiao, H. Liang, J. Xu, and W. Wang, “Critical Code Guided Directed Greybox Fuzzing for Commits,”Pro- ceedings of the 33rd USENIX Security Symposium, 2024

2024

[35] [35]

Can LLM Generate Regression Tests for Software Commits?

J. Liu, S. Lee, E. Losiouk, and M. Böhme, “Can LLM Generate Regression Tests for Software Commits?” arXiv preprint arXiv:2501.11086, 2025. [Online]. Available: https://arxiv.org/abs/2501.11086

work page arXiv 2025

[36] [36]

Sikand, V

X. Hu, Z. Liu, X. Xia, Z. Liu, T. Xu, and X. Yang, “Identify and update test cases when production code changes: A transformer- based approach,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’23. IEEE Press, 2024, p. 1111–1122. [Online]. Available: https://doi.org/10.1109/ ASE56229.2023.00165

work page arXiv 2024

[37] [37]

SWE-smith: Scaling Data for Software Engineering Agents

J. Yang, K. Lieret, C. E. Jimenez, A. Wettig, K. Khandpur, Y . Zhang, B. Hui, O. Press, L. Schmidt, and D. Yang, “Swe-smith: Scaling data for software engineering agents,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21798

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Swe-bench-java: A github issue resolving benchmark for java,

D. Zan, Z. Huang, A. Yu, S. Lin, Y . Shi, W. Liu, D. Chen, Z. Qi, H. Yu, L. Yuet al., “Swe-bench-java: A github issue resolving benchmark for java,”arXiv preprint arXiv:2408.14354, 2024

work page arXiv 2024

[39] [39]

(2025) Welcome to The Apache Software Foundation

The Apache Software Foundation. (2025) Welcome to The Apache Software Foundation. Accessed 27 August 2025. [Online]. Available: https://www.apache.org/

2025

[40] [40]

Apache maven,

——, “Apache maven,” https://maven.apache.org/, 2025, version 3.9.9, accessed 2025-10-14

2025

[41] [41]

(2021) Differential / regression test generation (evosuiter)

EvoSuite Project. (2021) Differential / regression test generation (evosuiter). Tutorial page. [Online]. Available: https://www.evosuite.org /evosuiter/

2021

[42] [42]

Evosuite 1.1.0 does not recongnize -regressionsuite parameter

leofernandesmo, “Evosuite 1.1.0 does not recongnize -regressionsuite parameter.” https://github.com/EvoSuite/evosuite/issues/353, 4 2021, issue #353 inEvoSuite/evosuiteon GitHub

2021

[43] [43]

Gpt-4o system card,

OpenAI, “Gpt-4o system card,” 8 2024, model system card. [Online]. Available: https://cdn.openai.com/gpt-4o-system-card.pdf

2024

[44] [44]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS 2022), ser. NeurIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022. [Online]. Av...

2022

[45] [45]

(2025) Maven surefire plugin

Apache Maven Project. (2025) Maven surefire plugin. Official documentation. [Online]. Available: https://maven.apache.org/surefire/ maven-surefire-plugin/

2025

[46] [46]

Com- bining multiple coverage criteria in search-based unit test generation,

J. M. Rojas, J. Campos, M. Vivanti, G. Fraser, and A. Arcuri, “Com- bining multiple coverage criteria in search-based unit test generation,” inInternational Symposium on Search Based Software Engineering. Springer, 2015, pp. 93–108

2015

[47] [47]

Yate: The role of test repair in llm-based unit test generation,

M. Konstantinou, R. Degiovanni, J. M. Zhang, M. Harman, and M. Papadakis, “Yate: The role of test repair in llm-based unit test generation,” 2025. [Online]. Available: https://arxiv.org/abs/2507.18316

work page arXiv 2025

[48] [48]

TestEval: Benchmarking large language models for test case generation,

W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “TestEval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New Mexico: Association for Computational Linguistics, Apr. 2025, pp...

2025

[49] [49]

Advancing code coverage: Incorporating program analysis with large language models,

C. Yang, J. Chen, B. Lin, Z. Wang, and J. Zhou, “Advancing code coverage: Incorporating program analysis with large language models,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 1, no. 1, pp. 1–31, Jan. 2025. [Online]. Available: https://doi.org/10.1145/3748505

work page doi:10.1145/3748505 2025

[50] [50]

SWE-Agent: Agentic software engineering with tool use,

Z. Li, R. Bommasani, D. Chen, and P. Liang, “SWE-Agent: Agentic software engineering with tool use,” inProc. 2024 Conf. Empirical Meth- ods in Natural Language Processing (EMNLP 2024), 2024, available at https://github.com/princeton-nlp/SWE-agent

2024

[51] [51]

(2025) Cline: Autonomous coding agent for VS Code

Cline Development Team. (2025) Cline: Autonomous coding agent for VS Code. Accessed: Oct. 12, 2025. [Online]. Available: https://marketplace.visualstudio.com/items?itemName=clinebcl.cline

2025

[52] [52]

(2025) Cursor: The ai code editor

Cursor AI. (2025) Cursor: The ai code editor. Accessed: Oct. 12, 2025. [Online]. Available: https://www.cursor.com

2025

[53] [53]

(2025) Windsurf: Multi-agent coding and large-scale software workflows

Windsurf AI. (2025) Windsurf: Multi-agent coding and large-scale software workflows. Accessed: Oct. 12, 2025. [Online]. Available: https://windsurf.ai

2025