Mutation-Guided Unit Test Generation with a Large Language Model

Guancheng Wang; Kui Liu; Lionel Briand; Qinghua Xu

arxiv: 2506.02954 · v8 · submitted 2025-06-03 · 💻 cs.SE

Mutation-Guided Unit Test Generation with a Large Language Model

Guancheng Wang , Qinghua Xu , Lionel Briand , Kui Liu This is my paper

Pith reviewed 2026-05-19 11:00 UTC · model grok-4.3

classification 💻 cs.SE

keywords unit test generationlarge language modelsmutation testingfault detectionsoftware testingLLM promptingtest suite qualitymutation score

0 comments

The pith

Mutation feedback in prompts lets LLMs generate unit tests that kill more mutants than coverage tools or basic prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that feeding information about still-living mutants back into an LLM prompt produces test suites with stronger fault-detection power than either traditional coverage-maximizing tools or unguided LLM prompting. It shows that high line or branch coverage can coexist with very low mutation scores, so coverage alone is an unreliable signal of real bug-finding ability. MUTGEN therefore treats mutation score as the primary objective and uses an iterative loop that keeps adding feedback on surviving mutants until no further progress occurs. If the approach holds, developers could obtain more effective automated tests without having to craft oracles by hand or rely on weak coverage targets.

Core claim

MUTGEN is a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. The study also analyzes the reasons for live and uncovered mutants and the impact of different mutation operators on generation effectiveness.

What carries the argument

Mutation-guided prompting, the mechanism that inserts a list of surviving mutants into each new prompt so the LLM is steered toward writing tests that kill them.

If this is right

Test suites produced this way reach higher mutation scores on the same subjects than either EvoSuite or plain LLM prompting.
An iterative loop that re-prompts the model with updated mutant status kills additional mutants beyond a single generation pass.
Some mutants remain live or uncovered even after iteration, revealing limits of current LLM generation.
Different mutation operators influence how effectively the LLM can target them in subsequent prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback pattern could be applied to other structural signals, such as data-flow or exception paths, to steer LLM tests further.
In larger codebases the iterative cost might be traded against manual test maintenance by producing suites that need less human repair.
The observed gap between coverage and mutation score suggests that future benchmarks should report mutation score as a primary metric rather than coverage alone.

Load-bearing premise

Mutation score is a reliable and stringent proxy for a test suite's real-world fault-detection capability.

What would settle it

A head-to-head evaluation on a set of real injected or historical bugs where the mutation-guided tests do not detect more faults than the EvoSuite or vanilla-LLM baselines.

Figures

Figures reproduced from arXiv: 2506.02954 by Guancheng Wang, Kui Liu, Lionel Briand, Qinghua Xu.

**Figure 1.** Figure 1: Example code under test When combined with the example code (including comments), LLMs can generate test cases that achieve high line and branch coverage. However, as demonstrated in prior work [9], [18], [28], high coverage does not necessarily imply strong fault detection capability, for example, when measured as mutation score. For instance, in our experiments, LLMs generate tests for the subject id_81… view at source ↗

**Figure 3.** Figure 3: Example mutation report B. Generation Stage To achieve a high mutation score, MUTGEN employs a prompt augmented with mutation feedback, collected during the preprocessing stage, to guide Llama-3.3 in generating test cases, as illustrated below. For this HumanEval-Java subject, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: Prompt used for summarizing In this example, Llama-3.3 returns the following summary: “This Java function, ‘validDate‘, validates a given date string in the format ”mm-dd-yyyy” and returns ‘true‘ if the date is valid according to specific rules regarding month and day ranges, and ‘false‘ otherwise. The input date string must be non-empty and follow the exact specified format. If any of these conditions are… view at source ↗

**Figure 4.** Figure 4: Prompt used for generation the beginning of this section, the LLM can only produce a test suite achieving a 53% mutation score, which remains unchanged even after four iterations. Prompt Used for Fixing ## Failed Test The following test failed in the test suite: {{ failed_test with error message }} You are processing execution failures. Please match only one of these guidelines and then try to correct the … view at source ↗

**Figure 5.** Figure 5: Prompt used for fixing III. OUR APPROACH In this section, we precisely describe and formalize our proposed approach, MUTGEN. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: MUTGEN Overview Our assumption, as reported in the mutation testing literature [11], [18], is that by maximizing the mutation score, we also maximize fault detection effectiveness. In any case, achieving mutation coverage is known to be a more stringent criterion than line or branch coverage [28] and therefore enables the generation of more effective test suites. However, as revealed in recent work [32], … view at source ↗

**Figure 7.** Figure 7: Mutation Score Changes over 4 Iterations: Ablation Study on Both [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Unit tests play a vital role in uncovering potential faults in software. While tools like EvoSuite focus on maximizing code coverage, recent advances in large language models (LLMs) have shifted attention toward LLM-based test generation. However, code coverage metrics -- such as line and branch coverage -- remain overly emphasized in reported research, despite being weak indicators of a test suite's fault-detection capability. In contrast, mutation score offers a more reliable and stringent measure, as demonstrated in our findings where some test suites achieve 100% coverage but only 4% mutation score. Although a few studies consider mutation score, the effectiveness of LLMs in killing mutants remains underexplored. In this paper, we propose MUTGEN, a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. Our study also provide insights into the limitations of LLM-based generation, analyzing the reasons for live and uncovered mutants, and the impact of different mutation operators on generation effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MUTGEN, a mutation-guided LLM-based unit test generation approach that incorporates mutation feedback directly into prompts and uses an iterative generation mechanism to kill additional mutants. Evaluated on 204 subjects from two benchmarks, MUTGEN is claimed to significantly outperform both EvoSuite and vanilla prompt-based LLM strategies in mutation score. The work also analyzes limitations including reasons for live and uncovered mutants and the effects of different mutation operators.

Significance. If the results hold after addressing attribution concerns, the work could meaningfully advance LLM-based test generation by showing how mutation feedback can improve fault-detection proxies beyond coverage metrics or unguided prompting. The scale of the evaluation (204 subjects across two benchmarks) and the inclusion of an iterative mechanism provide practical value, along with insights into LLM limitations. The emphasis on mutation score as a more stringent metric than coverage is a strength, though its translation to real-world bugs remains an interpretive point.

major comments (2)

[Evaluation] Evaluation section (comparison with vanilla baseline): The vanilla prompt-based strategy appears implemented as a single non-iterative prompt, while MUTGEN performs multiple rounds with mutation feedback. Without an explicit ablation that holds the number of LLM calls or iterations fixed, the measured gains in mutation score cannot be securely attributed to the mutation-guidance component rather than the benefits of additional iterations. This directly affects the central claim that mutation feedback drives the outperformance.
[Abstract] Abstract and motivation: The claim that mutation score is a more reliable proxy is illustrated by the 100% coverage / 4% mutation score example, but the paper does not provide evidence or discussion on how well mutation score correlates with actual fault detection for LLM-generated tests in this setting. This assumption underpins the choice to prioritize mutation score over coverage.

minor comments (2)

[Abstract] Abstract: 'Our study also provide insights' should be corrected to 'provides' for subject-verb agreement.
[Throughout] Notation and terminology: Ensure consistent capitalization and phrasing for 'mutation score', 'mutation feedback', and 'live mutants' across sections to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important aspects of our evaluation design and motivation that we will address to strengthen the paper. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Evaluation] Evaluation section (comparison with vanilla baseline): The vanilla prompt-based strategy appears implemented as a single non-iterative prompt, while MUTGEN performs multiple rounds with mutation feedback. Without an explicit ablation that holds the number of LLM calls or iterations fixed, the measured gains in mutation score cannot be securely attributed to the mutation-guidance component rather than the benefits of additional iterations. This directly affects the central claim that mutation feedback drives the outperformance.

Authors: We agree that the current comparison leaves room for confounding between the iterative process and the mutation feedback mechanism. The vanilla baseline was implemented as a single non-iterative prompt to represent standard LLM prompting practice, while the iterative refinement in MUTGEN is enabled by feeding back live mutant information. To isolate the contribution of mutation guidance, we will add an ablation study in the revised manuscript that applies an iterative vanilla prompting strategy using the same number of LLM calls and iterations as MUTGEN, allowing direct comparison of mutation scores. revision: yes
Referee: [Abstract] Abstract and motivation: The claim that mutation score is a more reliable proxy is illustrated by the 100% coverage / 4% mutation score example, but the paper does not provide evidence or discussion on how well mutation score correlates with actual fault detection for LLM-generated tests in this setting. This assumption underpins the choice to prioritize mutation score over coverage.

Authors: The 100% coverage / 4% mutation score example is taken directly from test suites generated in our experiments on the benchmark subjects. We reference prior studies in the literature that have shown mutation score to be a stronger indicator of fault detection than coverage in traditional testing contexts. We acknowledge that the manuscript would benefit from more explicit discussion of this correlation specifically for LLM-generated tests. In the revision we will expand the introduction and motivation sections with additional references to studies on mutation score versus real faults and will note the current lack of direct empirical correlation data for LLM-based test generation as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with external metrics

full rationale

The paper proposes MUTGEN as an LLM-based test generation technique that incorporates mutation feedback into prompts and uses an iterative mechanism. All central claims rest on direct experimental comparisons against EvoSuite and vanilla prompt baselines across 204 benchmark subjects, using mutation score as an independent, externally defined proxy for fault detection. No equations, derivations, fitted parameters, or first-principles results appear in the abstract or described method; mutation score is not constructed from the approach itself. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are referenced. The work is self-contained as a benchmark-driven empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions in software testing research such as the validity of mutation operators as fault proxies and the representativeness of the chosen benchmarks; no new free parameters or invented entities are introduced beyond typical LLM prompting hyperparameters.

axioms (1)

domain assumption Mutation score is a more reliable indicator of fault-detection capability than code coverage metrics.
Invoked in the abstract when stating that some test suites achieve 100% coverage but only 4% mutation score.

pith-pipeline@v0.9.0 · 5748 in / 1182 out tokens · 24039 ms · 2026-05-19T11:00:29.651716+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MUTGEN ... incorporates mutation feedback directly into the prompt ... iterative generation mechanism that pushes the limits of LLMs in killing additional mutants
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mutation score ... more reliable and stringent measure ... 100% coverage but only 4% mutation score

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Call-Chain-Aware LLM-Based Test Generation for Java Projects
cs.SE 2026-04 unverdicted novelty 6.0

CAT improves line coverage by 18% and branch coverage by 22% over prior LLM test generation methods by adding call-chain and dependency context from static analysis to prompts.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Description on mutation operators, Accessed: 2025

work page 2025
[2]

Jacoco, Accessed: 2025

work page 2025
[3]

Leetcode, Accessed: 2025

work page 2025
[4]

Mutahunter, Accessed: 2025

work page 2025
[5]

Ollama, Accessed: 2025

work page 2025
[6]

Pitest, Accessed: 2025

work page 2025
[7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

A3test: Assertion-augmented automated test case generation

Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Informa- tion and Software Technology , 176:107565, 2024

work page 2024
[9]

Automated unit test improvement using large lan- guage models at meta

Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. Automated unit test improvement using large lan- guage models at meta. corr abs/2402.09171 (2024). arXiv preprint arXiv:2402.09171, 10, 2024

work page arXiv 2024
[10]

An orchestrated survey of method- ologies for automated software test case generation

Saswat Anand, Edmund K Burke, Tsong Yueh Chen, John Clark, Myra B Cohen, Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia Bertolino, et al. An orchestrated survey of method- ologies for automated software test case generation. Journal of systems and software, 86(8):1978–2001, 2013

work page 1978
[11]

Using mutation analysis for assessing and comparing test- ing coverage criteria

James H Andrews, Lionel C Briand, Yvan Labiche, and Akbar Siami Namin. Using mutation analysis for assessing and comparing test- ing coverage criteria. IEEE Transactions on Software Engineering , 32(8):608–624, 2006

work page 2006
[12]

Genetic algorithms for randomized unit testing

James H Andrews, Tim Menzies, and Felix CH Li. Genetic algorithms for randomized unit testing. IEEE Transactions on Software Engineer- ing, 37(1):80–94, 2011

work page 2011
[13]

A practical guide for using statistical tests to assess randomized algorithms in software engineering

Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Pro- ceedings of the 33rd international conference on software engineering , pages 1–10, 2011

work page 2011
[14]

Multi-lingual evaluation of code generation models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 , 2022

work page arXiv 2022
[15]

Togll: Correct and strong test oracle generation with llms

Soneya Binta Hossain and Matthew Dwyer. Togll: Correct and strong test oracle generation with llms. arXiv e-prints, pages arXiv–2405, 2024

work page 2024
[16]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[17]

Automated test- case generation for rest apis using model inference search heuristic

Clinton Cao, Annibale Panichella, and Sicco Verwer. Automated test- case generation for rest apis using model inference search heuristic. arXiv preprint arXiv:2412.03420 , 2024

work page arXiv 2024
[18]

An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption

Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineer- ing (ICSE), pages 597–608. IEEE, 2017. 12

work page 2017
[19]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Chatunitest: A framework for llm-based test generation

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering , pages 572–576, 2024

work page 2024
[21]

Rug: Turbo llm for rust unit test generation

Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. Rug: Turbo llm for rust unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 634–634. IEEE Computer Society, 2025

work page 2025
[22]

Automatic test program generation: a case study

Fulvio Corno, Ernesto S ´anchez, Matteo Sonza Reorda, and Giovanni Squillero. Automatic test program generation: a case study. IEEE Design & Test of Computers , 21(2):102–109, 2004

work page 2004
[23]

Effective test generation using pre- trained large language models and mutation testing

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. Effective test generation using pre- trained large language models and mutation testing. Information and Software Technology, 171:107468, 2024

work page 2024
[24]

Leveraging large language models for enhancing the under- standability of generated unit tests

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the under- standability of generated unit tests. arXiv preprint arXiv:2408.11710 , 2024

work page arXiv 2024
[25]

Lrasgen: Llm-based restful api specification generation

Sida Deng, Rubing Huang, Man Zhang, Chenhui Cui, Dave Towey, and Rongcun Wang. Lrasgen: Llm-based restful api specification generation. arXiv preprint arXiv:2504.16833 , 2025

work page arXiv 2025
[26]

Toga: A neural method for test oracle generation

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering , pages 2130–2141, 2022

work page 2022
[27]

Large language models for software engineering: Survey and open problems

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53. IEEE, 2023

work page 2023
[28]

Mutation-guided llm-based test generation at meta

Christopher Foster, Abhishek Gulati, Mark Harman, Inna Harper, Ke Mao, Jillian Ritchey, Herv ´e Robert, and Shubho Sengupta. Mutation-guided llm-based test generation at meta. arXiv preprint arXiv:2501.12862, 2025

work page arXiv 2025
[29]

Evosuite: automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering , pages 416–419, 2011

work page 2011
[30]

A large-scale evaluation of automated unit test generation using evosuite

Gordon Fraser and Andrea Arcuri. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) , 24(2):1–42, 2014

work page 2014
[31]

Achieving scalable mutation-based generation of whole test suites

Gordon Fraser and Andrea Arcuri. Achieving scalable mutation-based generation of whole test suites. Empirical Software Engineering , 20(3):783–812, 2015

work page 2015
[32]

The prompt alchemist: Automated llm-tailored prompt optimization for test case generation

Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation. arXiv preprint arXiv:2501.01329 , 2025

work page arXiv 2025
[33]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Llm test generation via iterative hybrid program analysis

Sijia Gu, Noor Nashid, and Ali Mesbah. Llm test generation via iterative hybrid program analysis. arXiv preprint arXiv:2503.13580 , 2025

work page arXiv 2025
[35]

Improving llm-based unit test generation via template-based repair

Siqi Gu, Chunrong Fang, Quanjun Zhang, Fangyuan Tian, Jianyi Zhou, and Zhenyu Chen. Improving llm-based unit test generation via template-based repair. arXiv preprint arXiv:2408.03095 , 2024

work page arXiv 2024
[36]

Random testing

Richard Hamlet. Random testing. Encyclopedia of software Engineering, 2:971–978, 1994

work page 1994
[37]

Large language models for software engineering: A systematic literature review

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology , 33(8):1–79, 2024

work page 2024
[38]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Coverage is not strongly correlated with test suite effectiveness

Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th international conference on software engineering , pages 435–445, 2014

work page 2014
[40]

Mapcoder: Multi-agent code generation for competitive problem solving,

Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving,

work page
[41]

org/abs/2405, 11403

URL https://arxiv. org/abs/2405, 11403

work page
[42]

Defects4j: A database of existing faults to enable controlled testing studies for java programs

Ren ´e Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014

work page 2014
[43]

Augmentest: Enhancing tests with llm-driven oracles

Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi. Augmentest: Enhancing tests with llm-driven oracles. arXiv preprint arXiv:2501.17461, 2025

work page arXiv 2025
[44]

Llamaresttest: Effective rest api testing with small language models

Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. Llamaresttest: Effective rest api testing with small language models. arXiv preprint arXiv:2501.08598, 2025

work page arXiv 2025
[45]

Llm-assisted mutation for whitebox api testing

Jia Li, Jiacheng Shen, Yuxin Su, and Michael R Lyu. Llm-assisted mutation for whitebox api testing. arXiv preprint arXiv:2504.05738 , 2025

work page arXiv 2025
[46]

Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing- Chi Cheung, and Jeff Kramer. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14–26. IEEE, 2023

work page 2023
[47]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Llm-powered test case generation for detecting tricky bugs

Kaibo Liu, Yiyang Liu, Zhenpeng Chen, Jie M Zhang, Yudong Han, Yun Ma, Ge Li, and Gang Huang. Llm-powered test case generation for detecting tricky bugs. arXiv preprint arXiv:2404.10304 , 2024

work page arXiv 2024
[49]

Software testing and quality assurance: theory and practice

Kshirasagar Naik and Priyadarshi Tripathy. Software testing and quality assurance: theory and practice . John Wiley & Sons, 2011

work page 2011
[50]

Test intention guided llm-based unit test generation

Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. Test intention guided llm-based unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 779–779. IEEE Computer Society, 2025

work page 2025
[51]

Jartege: a tool for random generation of unit tests for java classes

Catherine Oriat. Jartege: a tool for random generation of unit tests for java classes. In International Conference on the Quality of Software Architectures, pages 242–256. Springer, 2005

work page 2005
[52]

Large- scale, independent and comprehensive study of the power of llms for test case generation

Wendk ˆuuni C Ou´edraogo, Kader Kabor´e, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawend´e F Bissyand´e. Large- scale, independent and comprehensive study of the power of llms for test case generation. arXiv preprint arXiv:2407.00225 , 2024

work page arXiv 2024
[53]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[54]

Randoop: feedback-directed random testing for java

Carlos Pacheco and Michael D Ernst. Randoop: feedback-directed random testing for java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pages 815–816, 2007

work page 2007
[55]

Feedback-directed random test generation

Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07) , pages 75–84. IEEE, 2007

work page 2007
[57]

Aster: Natural and multi-language unit test generation with llms

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. Aster: Natural and multi-language unit test generation with llms. arXiv preprint arXiv:2409.03093 , 2025

work page arXiv 2025
[58]

Savitha Ravi and Michael Coblenz

Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage- guided llm-based test generation. arXiv preprint arXiv:2403.16218 , 2024

work page arXiv 2024
[59]

Combining multiple coverage criteria in search-based unit test generation

Jos ´e Miguel Rojas, Jos ´e Campos, Mattia Vivanti, Gordon Fraser, and Andrea Arcuri. Combining multiple coverage criteria in search-based unit test generation. In Search-Based Software Engineering: 7th Inter- national Symposium, SSBSE 2015, Bergamo, Italy, September 5-7, 2015, Proceedings 7, pages 93–108. Springer, 2015

work page 2015
[60]

Seeding strate- gies in search-based unit test generation

Jos ´e Miguel Rojas, Gordon Fraser, and Andrea Arcuri. Seeding strate- gies in search-based unit test generation. Software Testing, Verification and Reliability, 26(5):366–401, 2016

work page 2016
[61]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm

Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024

work page 2024
[62]

Using large language models to generate junit tests: An empirical study

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Rid- wanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin ´ıcius 13 Carvalho Lopes. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 313–322, 2024

work page 2024
[63]

arXiv:2408.11198 (2024)

Hamed Taherkhani, Melika Sepindband, Hung Viet Pham, Song Wang, and Hadi Hemmati. Epic: Cost-effective search-based prompt engineer- ing of llms for code generation. arXiv preprint arXiv:2408.11198, 2024

work page arXiv 2024
[64]

Fixing large language models’ specification misunderstanding for better code generation

Zhao Tian, Junjie Chen, and Xiangyu Zhang. Fixing large language models’ specification misunderstanding for better code generation. In 2025 IEEE/ACM 47th International Conference on Software Engineer- ing (ICSE), pages 645–645. IEEE Computer Society, 2025

work page 2025
[65]

A critique and improvement of the cl common language effect size statistics of mcgraw and wong

Andr ´as Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of mcgraw and wong. Journal of Educational and Behavioral Statistics , 25(2):101–132, 2000

work page 2000
[66]

Search-based data-flow test generation

Mattia Vivanti, Andre Mis, Alessandra Gorla, and Gordon Fraser. Search-based data-flow test generation. In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) , pages 370–

work page 2013
[67]

Software testing with large language models: Survey, landscape, and vision

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering , 2024

work page 2024
[68]

Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing

Wenhan Wang, Xuan Xie, Yuheng Huang, Renzhi Wang, An Ran Chen, and Lei Ma. Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing. arXiv preprint arXiv:2501.09866, 2025

work page arXiv 2025
[69]

MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S Yu, and Chen Xing. Projecttest: A project-level unit test generation benchmark and impact of error fixing mechanisms. arXiv preprint arXiv:2502.06556 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Towards understanding the characteristics of code generation errors made by large language models

Zhijie Wang, Zijie Zhou, Yuheng Huang Da Song, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language models. Preprint, 2025

work page 2025
[71]

Clover: A test case generation benchmark with coverage, long-context, and verification

Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, and Yingbo Zhou. Clover: A test case generation benchmark with coverage, long-context, and verification. arXiv preprint arXiv:2502.08806 , 2025

work page arXiv 2025
[72]

On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages 1607–1619, 2024

work page 2024
[73]

Evaluating and improving chatgpt for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering , 1(FSE):1703–1726, 2024

work page 2024
[74]

Testbench: Evaluating class-level test case generation capability of large language models

Quanjun Zhang, Ye Shang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. Testbench: Evaluating class-level test case generation capability of large language models. arXiv preprint arXiv:2409.17561 , 2024

work page arXiv 2024
[75]

Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge

Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge. arXiv preprint arXiv:2501.16155 , 2025

work page arXiv 2025
[76]

Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 5673–5684, 2023

work page 2023
[77]

Understanding and characterizing mock assertions in unit tests

HENGCHENG ZHU, V ALERIO TERRAGNI, LILI WEI, SHING-CHI CHEUNG, JIARONG WU, and YEPANG LIU. Understanding and characterizing mock assertions in unit tests. 2025

work page 2025

[1] [1]

Description on mutation operators, Accessed: 2025

work page 2025

[2] [2]

Jacoco, Accessed: 2025

work page 2025

[3] [3]

Leetcode, Accessed: 2025

work page 2025

[4] [4]

Mutahunter, Accessed: 2025

work page 2025

[5] [5]

Ollama, Accessed: 2025

work page 2025

[6] [6]

Pitest, Accessed: 2025

work page 2025

[7] [7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

A3test: Assertion-augmented automated test case generation

Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Informa- tion and Software Technology , 176:107565, 2024

work page 2024

[9] [9]

Automated unit test improvement using large lan- guage models at meta

Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. Automated unit test improvement using large lan- guage models at meta. corr abs/2402.09171 (2024). arXiv preprint arXiv:2402.09171, 10, 2024

work page arXiv 2024

[10] [10]

An orchestrated survey of method- ologies for automated software test case generation

Saswat Anand, Edmund K Burke, Tsong Yueh Chen, John Clark, Myra B Cohen, Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia Bertolino, et al. An orchestrated survey of method- ologies for automated software test case generation. Journal of systems and software, 86(8):1978–2001, 2013

work page 1978

[11] [11]

Using mutation analysis for assessing and comparing test- ing coverage criteria

James H Andrews, Lionel C Briand, Yvan Labiche, and Akbar Siami Namin. Using mutation analysis for assessing and comparing test- ing coverage criteria. IEEE Transactions on Software Engineering , 32(8):608–624, 2006

work page 2006

[12] [12]

Genetic algorithms for randomized unit testing

James H Andrews, Tim Menzies, and Felix CH Li. Genetic algorithms for randomized unit testing. IEEE Transactions on Software Engineer- ing, 37(1):80–94, 2011

work page 2011

[13] [13]

A practical guide for using statistical tests to assess randomized algorithms in software engineering

Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Pro- ceedings of the 33rd international conference on software engineering , pages 1–10, 2011

work page 2011

[14] [14]

Multi-lingual evaluation of code generation models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 , 2022

work page arXiv 2022

[15] [15]

Togll: Correct and strong test oracle generation with llms

Soneya Binta Hossain and Matthew Dwyer. Togll: Correct and strong test oracle generation with llms. arXiv e-prints, pages arXiv–2405, 2024

work page 2024

[16] [16]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901

[17] [17]

Automated test- case generation for rest apis using model inference search heuristic

Clinton Cao, Annibale Panichella, and Sicco Verwer. Automated test- case generation for rest apis using model inference search heuristic. arXiv preprint arXiv:2412.03420 , 2024

work page arXiv 2024

[18] [18]

An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption

Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineer- ing (ICSE), pages 597–608. IEEE, 2017. 12

work page 2017

[19] [19]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Chatunitest: A framework for llm-based test generation

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering , pages 572–576, 2024

work page 2024

[21] [21]

Rug: Turbo llm for rust unit test generation

Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. Rug: Turbo llm for rust unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 634–634. IEEE Computer Society, 2025

work page 2025

[22] [22]

Automatic test program generation: a case study

Fulvio Corno, Ernesto S ´anchez, Matteo Sonza Reorda, and Giovanni Squillero. Automatic test program generation: a case study. IEEE Design & Test of Computers , 21(2):102–109, 2004

work page 2004

[23] [23]

Effective test generation using pre- trained large language models and mutation testing

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. Effective test generation using pre- trained large language models and mutation testing. Information and Software Technology, 171:107468, 2024

work page 2024

[24] [24]

Leveraging large language models for enhancing the under- standability of generated unit tests

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the under- standability of generated unit tests. arXiv preprint arXiv:2408.11710 , 2024

work page arXiv 2024

[25] [25]

Lrasgen: Llm-based restful api specification generation

Sida Deng, Rubing Huang, Man Zhang, Chenhui Cui, Dave Towey, and Rongcun Wang. Lrasgen: Llm-based restful api specification generation. arXiv preprint arXiv:2504.16833 , 2025

work page arXiv 2025

[26] [26]

Toga: A neural method for test oracle generation

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering , pages 2130–2141, 2022

work page 2022

[27] [27]

Large language models for software engineering: Survey and open problems

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53. IEEE, 2023

work page 2023

[28] [28]

Mutation-guided llm-based test generation at meta

Christopher Foster, Abhishek Gulati, Mark Harman, Inna Harper, Ke Mao, Jillian Ritchey, Herv ´e Robert, and Shubho Sengupta. Mutation-guided llm-based test generation at meta. arXiv preprint arXiv:2501.12862, 2025

work page arXiv 2025

[29] [29]

Evosuite: automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering , pages 416–419, 2011

work page 2011

[30] [30]

A large-scale evaluation of automated unit test generation using evosuite

Gordon Fraser and Andrea Arcuri. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) , 24(2):1–42, 2014

work page 2014

[31] [31]

Achieving scalable mutation-based generation of whole test suites

Gordon Fraser and Andrea Arcuri. Achieving scalable mutation-based generation of whole test suites. Empirical Software Engineering , 20(3):783–812, 2015

work page 2015

[32] [32]

The prompt alchemist: Automated llm-tailored prompt optimization for test case generation

Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation. arXiv preprint arXiv:2501.01329 , 2025

work page arXiv 2025

[33] [33]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Llm test generation via iterative hybrid program analysis

Sijia Gu, Noor Nashid, and Ali Mesbah. Llm test generation via iterative hybrid program analysis. arXiv preprint arXiv:2503.13580 , 2025

work page arXiv 2025

[35] [35]

Improving llm-based unit test generation via template-based repair

Siqi Gu, Chunrong Fang, Quanjun Zhang, Fangyuan Tian, Jianyi Zhou, and Zhenyu Chen. Improving llm-based unit test generation via template-based repair. arXiv preprint arXiv:2408.03095 , 2024

work page arXiv 2024

[36] [36]

Random testing

Richard Hamlet. Random testing. Encyclopedia of software Engineering, 2:971–978, 1994

work page 1994

[37] [37]

Large language models for software engineering: A systematic literature review

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology , 33(8):1–79, 2024

work page 2024

[38] [38]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Coverage is not strongly correlated with test suite effectiveness

Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th international conference on software engineering , pages 435–445, 2014

work page 2014

[40] [40]

Mapcoder: Multi-agent code generation for competitive problem solving,

Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving,

work page

[41] [41]

org/abs/2405, 11403

URL https://arxiv. org/abs/2405, 11403

work page

[42] [42]

Defects4j: A database of existing faults to enable controlled testing studies for java programs

Ren ´e Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014

work page 2014

[43] [43]

Augmentest: Enhancing tests with llm-driven oracles

Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi. Augmentest: Enhancing tests with llm-driven oracles. arXiv preprint arXiv:2501.17461, 2025

work page arXiv 2025

[44] [44]

Llamaresttest: Effective rest api testing with small language models

Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. Llamaresttest: Effective rest api testing with small language models. arXiv preprint arXiv:2501.08598, 2025

work page arXiv 2025

[45] [45]

Llm-assisted mutation for whitebox api testing

Jia Li, Jiacheng Shen, Yuxin Su, and Michael R Lyu. Llm-assisted mutation for whitebox api testing. arXiv preprint arXiv:2504.05738 , 2025

work page arXiv 2025

[46] [46]

Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing- Chi Cheung, and Jeff Kramer. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14–26. IEEE, 2023

work page 2023

[47] [47]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Llm-powered test case generation for detecting tricky bugs

Kaibo Liu, Yiyang Liu, Zhenpeng Chen, Jie M Zhang, Yudong Han, Yun Ma, Ge Li, and Gang Huang. Llm-powered test case generation for detecting tricky bugs. arXiv preprint arXiv:2404.10304 , 2024

work page arXiv 2024

[49] [49]

Software testing and quality assurance: theory and practice

Kshirasagar Naik and Priyadarshi Tripathy. Software testing and quality assurance: theory and practice . John Wiley & Sons, 2011

work page 2011

[50] [50]

Test intention guided llm-based unit test generation

Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. Test intention guided llm-based unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 779–779. IEEE Computer Society, 2025

work page 2025

[51] [51]

Jartege: a tool for random generation of unit tests for java classes

Catherine Oriat. Jartege: a tool for random generation of unit tests for java classes. In International Conference on the Quality of Software Architectures, pages 242–256. Springer, 2005

work page 2005

[52] [52]

Large- scale, independent and comprehensive study of the power of llms for test case generation

Wendk ˆuuni C Ou´edraogo, Kader Kabor´e, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawend´e F Bissyand´e. Large- scale, independent and comprehensive study of the power of llms for test case generation. arXiv preprint arXiv:2407.00225 , 2024

work page arXiv 2024

[53] [53]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[54] [54]

Randoop: feedback-directed random testing for java

Carlos Pacheco and Michael D Ernst. Randoop: feedback-directed random testing for java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pages 815–816, 2007

work page 2007

[55] [55]

Feedback-directed random test generation

Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07) , pages 75–84. IEEE, 2007

work page 2007

[56] [57]

Aster: Natural and multi-language unit test generation with llms

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. Aster: Natural and multi-language unit test generation with llms. arXiv preprint arXiv:2409.03093 , 2025

work page arXiv 2025

[57] [58]

Savitha Ravi and Michael Coblenz

Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage- guided llm-based test generation. arXiv preprint arXiv:2403.16218 , 2024

work page arXiv 2024

[58] [59]

Combining multiple coverage criteria in search-based unit test generation

Jos ´e Miguel Rojas, Jos ´e Campos, Mattia Vivanti, Gordon Fraser, and Andrea Arcuri. Combining multiple coverage criteria in search-based unit test generation. In Search-Based Software Engineering: 7th Inter- national Symposium, SSBSE 2015, Bergamo, Italy, September 5-7, 2015, Proceedings 7, pages 93–108. Springer, 2015

work page 2015

[59] [60]

Seeding strate- gies in search-based unit test generation

Jos ´e Miguel Rojas, Gordon Fraser, and Andrea Arcuri. Seeding strate- gies in search-based unit test generation. Software Testing, Verification and Reliability, 26(5):366–401, 2016

work page 2016

[60] [61]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm

Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024

work page 2024

[61] [62]

Using large language models to generate junit tests: An empirical study

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Rid- wanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin ´ıcius 13 Carvalho Lopes. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 313–322, 2024

work page 2024

[62] [63]

arXiv:2408.11198 (2024)

Hamed Taherkhani, Melika Sepindband, Hung Viet Pham, Song Wang, and Hadi Hemmati. Epic: Cost-effective search-based prompt engineer- ing of llms for code generation. arXiv preprint arXiv:2408.11198, 2024

work page arXiv 2024

[63] [64]

Fixing large language models’ specification misunderstanding for better code generation

Zhao Tian, Junjie Chen, and Xiangyu Zhang. Fixing large language models’ specification misunderstanding for better code generation. In 2025 IEEE/ACM 47th International Conference on Software Engineer- ing (ICSE), pages 645–645. IEEE Computer Society, 2025

work page 2025

[64] [65]

A critique and improvement of the cl common language effect size statistics of mcgraw and wong

Andr ´as Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of mcgraw and wong. Journal of Educational and Behavioral Statistics , 25(2):101–132, 2000

work page 2000

[65] [66]

Search-based data-flow test generation

Mattia Vivanti, Andre Mis, Alessandra Gorla, and Gordon Fraser. Search-based data-flow test generation. In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) , pages 370–

work page 2013

[66] [67]

Software testing with large language models: Survey, landscape, and vision

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering , 2024

work page 2024

[67] [68]

Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing

Wenhan Wang, Xuan Xie, Yuheng Huang, Renzhi Wang, An Ran Chen, and Lei Ma. Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing. arXiv preprint arXiv:2501.09866, 2025

work page arXiv 2025

[68] [69]

MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S Yu, and Chen Xing. Projecttest: A project-level unit test generation benchmark and impact of error fixing mechanisms. arXiv preprint arXiv:2502.06556 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [70]

Towards understanding the characteristics of code generation errors made by large language models

Zhijie Wang, Zijie Zhou, Yuheng Huang Da Song, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language models. Preprint, 2025

work page 2025

[70] [71]

Clover: A test case generation benchmark with coverage, long-context, and verification

Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, and Yingbo Zhou. Clover: A test case generation benchmark with coverage, long-context, and verification. arXiv preprint arXiv:2502.08806 , 2025

work page arXiv 2025

[71] [72]

On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages 1607–1619, 2024

work page 2024

[72] [73]

Evaluating and improving chatgpt for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering , 1(FSE):1703–1726, 2024

work page 2024

[73] [74]

Testbench: Evaluating class-level test case generation capability of large language models

Quanjun Zhang, Ye Shang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. Testbench: Evaluating class-level test case generation capability of large language models. arXiv preprint arXiv:2409.17561 , 2024

work page arXiv 2024

[74] [75]

Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge

Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge. arXiv preprint arXiv:2501.16155 , 2025

work page arXiv 2025

[75] [76]

Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 5673–5684, 2023

work page 2023

[76] [77]

Understanding and characterizing mock assertions in unit tests

HENGCHENG ZHU, V ALERIO TERRAGNI, LILI WEI, SHING-CHI CHEUNG, JIARONG WU, and YEPANG LIU. Understanding and characterizing mock assertions in unit tests. 2025

work page 2025