pith. sign in

arxiv: 2506.02954 · v8 · submitted 2025-06-03 · 💻 cs.SE

Mutation-Guided Unit Test Generation with a Large Language Model

Pith reviewed 2026-05-19 11:00 UTC · model grok-4.3

classification 💻 cs.SE
keywords unit test generationlarge language modelsmutation testingfault detectionsoftware testingLLM promptingtest suite qualitymutation score
0
0 comments X

The pith

Mutation feedback in prompts lets LLMs generate unit tests that kill more mutants than coverage tools or basic prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that feeding information about still-living mutants back into an LLM prompt produces test suites with stronger fault-detection power than either traditional coverage-maximizing tools or unguided LLM prompting. It shows that high line or branch coverage can coexist with very low mutation scores, so coverage alone is an unreliable signal of real bug-finding ability. MUTGEN therefore treats mutation score as the primary objective and uses an iterative loop that keeps adding feedback on surviving mutants until no further progress occurs. If the approach holds, developers could obtain more effective automated tests without having to craft oracles by hand or rely on weak coverage targets.

Core claim

MUTGEN is a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. The study also analyzes the reasons for live and uncovered mutants and the impact of different mutation operators on generation effectiveness.

What carries the argument

Mutation-guided prompting, the mechanism that inserts a list of surviving mutants into each new prompt so the LLM is steered toward writing tests that kill them.

If this is right

  • Test suites produced this way reach higher mutation scores on the same subjects than either EvoSuite or plain LLM prompting.
  • An iterative loop that re-prompts the model with updated mutant status kills additional mutants beyond a single generation pass.
  • Some mutants remain live or uncovered even after iteration, revealing limits of current LLM generation.
  • Different mutation operators influence how effectively the LLM can target them in subsequent prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback pattern could be applied to other structural signals, such as data-flow or exception paths, to steer LLM tests further.
  • In larger codebases the iterative cost might be traded against manual test maintenance by producing suites that need less human repair.
  • The observed gap between coverage and mutation score suggests that future benchmarks should report mutation score as a primary metric rather than coverage alone.

Load-bearing premise

Mutation score is a reliable and stringent proxy for a test suite's real-world fault-detection capability.

What would settle it

A head-to-head evaluation on a set of real injected or historical bugs where the mutation-guided tests do not detect more faults than the EvoSuite or vanilla-LLM baselines.

Figures

Figures reproduced from arXiv: 2506.02954 by Guancheng Wang, Kui Liu, Lionel Briand, Qinghua Xu.

Figure 1
Figure 1. Figure 1: Example code under test When combined with the example code (including com￾ments), LLMs can generate test cases that achieve high line and branch coverage. However, as demonstrated in prior work [9], [18], [28], high coverage does not necessarily imply strong fault detection capability, for example, when measured as mutation score. For instance, in our experiments, LLMs generate tests for the subject id_81… view at source ↗
Figure 3
Figure 3. Figure 3: Example mutation report B. Generation Stage To achieve a high mutation score, MUTGEN employs a prompt augmented with mutation feedback, collected during the preprocessing stage, to guide Llama-3.3 in generating test cases, as illustrated below. For this HumanEval-Java subject, as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt used for summarizing In this example, Llama-3.3 returns the following summary: “This Java function, ‘validDate‘, validates a given date string in the format ”mm-dd-yyyy” and returns ‘true‘ if the date is valid according to specific rules regarding month and day ranges, and ‘false‘ otherwise. The input date string must be non-empty and follow the exact specified format. If any of these conditions are… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt used for generation the beginning of this section, the LLM can only produce a test suite achieving a 53% mutation score, which remains unchanged even after four iterations. Prompt Used for Fixing ## Failed Test The following test failed in the test suite: {{ failed_test with error message }} You are processing execution failures. Please match only one of these guidelines and then try to correct the … view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used for fixing III. OUR APPROACH In this section, we precisely describe and formalize our pro￾posed approach, MUTGEN. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MUTGEN Overview Our assumption, as reported in the mutation testing liter￾ature [11], [18], is that by maximizing the mutation score, we also maximize fault detection effectiveness. In any case, achieving mutation coverage is known to be a more stringent criterion than line or branch coverage [28] and therefore enables the generation of more effective test suites. However, as revealed in recent work [32], … view at source ↗
Figure 7
Figure 7. Figure 7: Mutation Score Changes over 4 Iterations: Ablation Study on Both [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Unit tests play a vital role in uncovering potential faults in software. While tools like EvoSuite focus on maximizing code coverage, recent advances in large language models (LLMs) have shifted attention toward LLM-based test generation. However, code coverage metrics -- such as line and branch coverage -- remain overly emphasized in reported research, despite being weak indicators of a test suite's fault-detection capability. In contrast, mutation score offers a more reliable and stringent measure, as demonstrated in our findings where some test suites achieve 100% coverage but only 4% mutation score. Although a few studies consider mutation score, the effectiveness of LLMs in killing mutants remains underexplored. In this paper, we propose MUTGEN, a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. Our study also provide insights into the limitations of LLM-based generation, analyzing the reasons for live and uncovered mutants, and the impact of different mutation operators on generation effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MUTGEN, a mutation-guided LLM-based unit test generation approach that incorporates mutation feedback directly into prompts and uses an iterative generation mechanism to kill additional mutants. Evaluated on 204 subjects from two benchmarks, MUTGEN is claimed to significantly outperform both EvoSuite and vanilla prompt-based LLM strategies in mutation score. The work also analyzes limitations including reasons for live and uncovered mutants and the effects of different mutation operators.

Significance. If the results hold after addressing attribution concerns, the work could meaningfully advance LLM-based test generation by showing how mutation feedback can improve fault-detection proxies beyond coverage metrics or unguided prompting. The scale of the evaluation (204 subjects across two benchmarks) and the inclusion of an iterative mechanism provide practical value, along with insights into LLM limitations. The emphasis on mutation score as a more stringent metric than coverage is a strength, though its translation to real-world bugs remains an interpretive point.

major comments (2)
  1. [Evaluation] Evaluation section (comparison with vanilla baseline): The vanilla prompt-based strategy appears implemented as a single non-iterative prompt, while MUTGEN performs multiple rounds with mutation feedback. Without an explicit ablation that holds the number of LLM calls or iterations fixed, the measured gains in mutation score cannot be securely attributed to the mutation-guidance component rather than the benefits of additional iterations. This directly affects the central claim that mutation feedback drives the outperformance.
  2. [Abstract] Abstract and motivation: The claim that mutation score is a more reliable proxy is illustrated by the 100% coverage / 4% mutation score example, but the paper does not provide evidence or discussion on how well mutation score correlates with actual fault detection for LLM-generated tests in this setting. This assumption underpins the choice to prioritize mutation score over coverage.
minor comments (2)
  1. [Abstract] Abstract: 'Our study also provide insights' should be corrected to 'provides' for subject-verb agreement.
  2. [Throughout] Notation and terminology: Ensure consistent capitalization and phrasing for 'mutation score', 'mutation feedback', and 'live mutants' across sections to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important aspects of our evaluation design and motivation that we will address to strengthen the paper. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (comparison with vanilla baseline): The vanilla prompt-based strategy appears implemented as a single non-iterative prompt, while MUTGEN performs multiple rounds with mutation feedback. Without an explicit ablation that holds the number of LLM calls or iterations fixed, the measured gains in mutation score cannot be securely attributed to the mutation-guidance component rather than the benefits of additional iterations. This directly affects the central claim that mutation feedback drives the outperformance.

    Authors: We agree that the current comparison leaves room for confounding between the iterative process and the mutation feedback mechanism. The vanilla baseline was implemented as a single non-iterative prompt to represent standard LLM prompting practice, while the iterative refinement in MUTGEN is enabled by feeding back live mutant information. To isolate the contribution of mutation guidance, we will add an ablation study in the revised manuscript that applies an iterative vanilla prompting strategy using the same number of LLM calls and iterations as MUTGEN, allowing direct comparison of mutation scores. revision: yes

  2. Referee: [Abstract] Abstract and motivation: The claim that mutation score is a more reliable proxy is illustrated by the 100% coverage / 4% mutation score example, but the paper does not provide evidence or discussion on how well mutation score correlates with actual fault detection for LLM-generated tests in this setting. This assumption underpins the choice to prioritize mutation score over coverage.

    Authors: The 100% coverage / 4% mutation score example is taken directly from test suites generated in our experiments on the benchmark subjects. We reference prior studies in the literature that have shown mutation score to be a stronger indicator of fault detection than coverage in traditional testing contexts. We acknowledge that the manuscript would benefit from more explicit discussion of this correlation specifically for LLM-generated tests. In the revision we will expand the introduction and motivation sections with additional references to studies on mutation score versus real faults and will note the current lack of direct empirical correlation data for LLM-based test generation as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with external metrics

full rationale

The paper proposes MUTGEN as an LLM-based test generation technique that incorporates mutation feedback into prompts and uses an iterative mechanism. All central claims rest on direct experimental comparisons against EvoSuite and vanilla prompt baselines across 204 benchmark subjects, using mutation score as an independent, externally defined proxy for fault detection. No equations, derivations, fitted parameters, or first-principles results appear in the abstract or described method; mutation score is not constructed from the approach itself. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are referenced. The work is self-contained as a benchmark-driven empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions in software testing research such as the validity of mutation operators as fault proxies and the representativeness of the chosen benchmarks; no new free parameters or invented entities are introduced beyond typical LLM prompting hyperparameters.

axioms (1)
  • domain assumption Mutation score is a more reliable indicator of fault-detection capability than code coverage metrics.
    Invoked in the abstract when stating that some test suites achieve 100% coverage but only 4% mutation score.

pith-pipeline@v0.9.0 · 5748 in / 1182 out tokens · 24039 ms · 2026-05-19T11:00:29.651716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Call-Chain-Aware LLM-Based Test Generation for Java Projects

    cs.SE 2026-04 unverdicted novelty 6.0

    CAT improves line coverage by 18% and branch coverage by 22% over prior LLM test generation methods by adding call-chain and dependency context from static analysis to prompts.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Description on mutation operators, Accessed: 2025

  2. [2]

    Jacoco, Accessed: 2025

  3. [3]

    Leetcode, Accessed: 2025

  4. [4]

    Mutahunter, Accessed: 2025

  5. [5]

    Ollama, Accessed: 2025

  6. [6]

    Pitest, Accessed: 2025

  7. [7]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    A3test: Assertion-augmented automated test case generation

    Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Informa- tion and Software Technology , 176:107565, 2024

  9. [9]

    Automated unit test improvement using large lan- guage models at meta

    Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. Automated unit test improvement using large lan- guage models at meta. corr abs/2402.09171 (2024). arXiv preprint arXiv:2402.09171, 10, 2024

  10. [10]

    An orchestrated survey of method- ologies for automated software test case generation

    Saswat Anand, Edmund K Burke, Tsong Yueh Chen, John Clark, Myra B Cohen, Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia Bertolino, et al. An orchestrated survey of method- ologies for automated software test case generation. Journal of systems and software, 86(8):1978–2001, 2013

  11. [11]

    Using mutation analysis for assessing and comparing test- ing coverage criteria

    James H Andrews, Lionel C Briand, Yvan Labiche, and Akbar Siami Namin. Using mutation analysis for assessing and comparing test- ing coverage criteria. IEEE Transactions on Software Engineering , 32(8):608–624, 2006

  12. [12]

    Genetic algorithms for randomized unit testing

    James H Andrews, Tim Menzies, and Felix CH Li. Genetic algorithms for randomized unit testing. IEEE Transactions on Software Engineer- ing, 37(1):80–94, 2011

  13. [13]

    A practical guide for using statistical tests to assess randomized algorithms in software engineering

    Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Pro- ceedings of the 33rd international conference on software engineering , pages 1–10, 2011

  14. [14]

    Multi-lingual evaluation of code generation models

    Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 , 2022

  15. [15]

    Togll: Correct and strong test oracle generation with llms

    Soneya Binta Hossain and Matthew Dwyer. Togll: Correct and strong test oracle generation with llms. arXiv e-prints, pages arXiv–2405, 2024

  16. [16]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  17. [17]

    Automated test- case generation for rest apis using model inference search heuristic

    Clinton Cao, Annibale Panichella, and Sicco Verwer. Automated test- case generation for rest apis using model inference search heuristic. arXiv preprint arXiv:2412.03420 , 2024

  18. [18]

    An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption

    Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineer- ing (ICSE), pages 597–608. IEEE, 2017. 12

  19. [19]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021

  20. [20]

    Chatunitest: A framework for llm-based test generation

    Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering , pages 572–576, 2024

  21. [21]

    Rug: Turbo llm for rust unit test generation

    Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. Rug: Turbo llm for rust unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 634–634. IEEE Computer Society, 2025

  22. [22]

    Automatic test program generation: a case study

    Fulvio Corno, Ernesto S ´anchez, Matteo Sonza Reorda, and Giovanni Squillero. Automatic test program generation: a case study. IEEE Design & Test of Computers , 21(2):102–109, 2004

  23. [23]

    Effective test generation using pre- trained large language models and mutation testing

    Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. Effective test generation using pre- trained large language models and mutation testing. Information and Software Technology, 171:107468, 2024

  24. [24]

    Leveraging large language models for enhancing the under- standability of generated unit tests

    Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the under- standability of generated unit tests. arXiv preprint arXiv:2408.11710 , 2024

  25. [25]

    Lrasgen: Llm-based restful api specification generation

    Sida Deng, Rubing Huang, Man Zhang, Chenhui Cui, Dave Towey, and Rongcun Wang. Lrasgen: Llm-based restful api specification generation. arXiv preprint arXiv:2504.16833 , 2025

  26. [26]

    Toga: A neural method for test oracle generation

    Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering , pages 2130–2141, 2022

  27. [27]

    Large language models for software engineering: Survey and open problems

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53. IEEE, 2023

  28. [28]

    Mutation-guided llm-based test generation at meta

    Christopher Foster, Abhishek Gulati, Mark Harman, Inna Harper, Ke Mao, Jillian Ritchey, Herv ´e Robert, and Shubho Sengupta. Mutation-guided llm-based test generation at meta. arXiv preprint arXiv:2501.12862, 2025

  29. [29]

    Evosuite: automatic test suite generation for object-oriented software

    Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering , pages 416–419, 2011

  30. [30]

    A large-scale evaluation of automated unit test generation using evosuite

    Gordon Fraser and Andrea Arcuri. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) , 24(2):1–42, 2014

  31. [31]

    Achieving scalable mutation-based generation of whole test suites

    Gordon Fraser and Andrea Arcuri. Achieving scalable mutation-based generation of whole test suites. Empirical Software Engineering , 20(3):783–812, 2015

  32. [32]

    The prompt alchemist: Automated llm-tailored prompt optimization for test case generation

    Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation. arXiv preprint arXiv:2501.01329 , 2025

  33. [33]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  34. [34]

    Llm test generation via iterative hybrid program analysis

    Sijia Gu, Noor Nashid, and Ali Mesbah. Llm test generation via iterative hybrid program analysis. arXiv preprint arXiv:2503.13580 , 2025

  35. [35]

    Improving llm-based unit test generation via template-based repair

    Siqi Gu, Chunrong Fang, Quanjun Zhang, Fangyuan Tian, Jianyi Zhou, and Zhenyu Chen. Improving llm-based unit test generation via template-based repair. arXiv preprint arXiv:2408.03095 , 2024

  36. [36]

    Random testing

    Richard Hamlet. Random testing. Encyclopedia of software Engineering, 2:971–978, 1994

  37. [37]

    Large language models for software engineering: A systematic literature review

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology , 33(8):1–79, 2024

  38. [38]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 , 2023

  39. [39]

    Coverage is not strongly correlated with test suite effectiveness

    Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th international conference on software engineering , pages 435–445, 2014

  40. [40]

    Mapcoder: Multi-agent code generation for competitive problem solving,

    Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving,

  41. [41]

    org/abs/2405, 11403

    URL https://arxiv. org/abs/2405, 11403

  42. [42]

    Defects4j: A database of existing faults to enable controlled testing studies for java programs

    Ren ´e Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014

  43. [43]

    Augmentest: Enhancing tests with llm-driven oracles

    Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi. Augmentest: Enhancing tests with llm-driven oracles. arXiv preprint arXiv:2501.17461, 2025

  44. [44]

    Llamaresttest: Effective rest api testing with small language models

    Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. Llamaresttest: Effective rest api testing with small language models. arXiv preprint arXiv:2501.08598, 2025

  45. [45]

    Llm-assisted mutation for whitebox api testing

    Jia Li, Jiacheng Shen, Yuxin Su, and Michael R Lyu. Llm-assisted mutation for whitebox api testing. arXiv preprint arXiv:2504.05738 , 2025

  46. [46]

    Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting

    Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing- Chi Cheung, and Jeff Kramer. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14–26. IEEE, 2023

  47. [47]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024

  48. [48]

    Llm-powered test case generation for detecting tricky bugs

    Kaibo Liu, Yiyang Liu, Zhenpeng Chen, Jie M Zhang, Yudong Han, Yun Ma, Ge Li, and Gang Huang. Llm-powered test case generation for detecting tricky bugs. arXiv preprint arXiv:2404.10304 , 2024

  49. [49]

    Software testing and quality assurance: theory and practice

    Kshirasagar Naik and Priyadarshi Tripathy. Software testing and quality assurance: theory and practice . John Wiley & Sons, 2011

  50. [50]

    Test intention guided llm-based unit test generation

    Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. Test intention guided llm-based unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 779–779. IEEE Computer Society, 2025

  51. [51]

    Jartege: a tool for random generation of unit tests for java classes

    Catherine Oriat. Jartege: a tool for random generation of unit tests for java classes. In International Conference on the Quality of Software Architectures, pages 242–256. Springer, 2005

  52. [52]

    Large- scale, independent and comprehensive study of the power of llms for test case generation

    Wendk ˆuuni C Ou´edraogo, Kader Kabor´e, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawend´e F Bissyand´e. Large- scale, independent and comprehensive study of the power of llms for test case generation. arXiv preprint arXiv:2407.00225 , 2024

  53. [53]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  54. [54]

    Randoop: feedback-directed random testing for java

    Carlos Pacheco and Michael D Ernst. Randoop: feedback-directed random testing for java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pages 815–816, 2007

  55. [55]

    Feedback-directed random test generation

    Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07) , pages 75–84. IEEE, 2007

  56. [57]

    Aster: Natural and multi-language unit test generation with llms

    Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. Aster: Natural and multi-language unit test generation with llms. arXiv preprint arXiv:2409.03093 , 2025

  57. [58]

    Savitha Ravi and Michael Coblenz

    Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage- guided llm-based test generation. arXiv preprint arXiv:2403.16218 , 2024

  58. [59]

    Combining multiple coverage criteria in search-based unit test generation

    Jos ´e Miguel Rojas, Jos ´e Campos, Mattia Vivanti, Gordon Fraser, and Andrea Arcuri. Combining multiple coverage criteria in search-based unit test generation. In Search-Based Software Engineering: 7th Inter- national Symposium, SSBSE 2015, Bergamo, Italy, September 5-7, 2015, Proceedings 7, pages 93–108. Springer, 2015

  59. [60]

    Seeding strate- gies in search-based unit test generation

    Jos ´e Miguel Rojas, Gordon Fraser, and Andrea Arcuri. Seeding strate- gies in search-based unit test generation. Software Testing, Verification and Reliability, 26(5):366–401, 2016

  60. [61]

    Code-aware prompting: A study of coverage-guided test generation in regression setting using llm

    Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024

  61. [62]

    Using large language models to generate junit tests: An empirical study

    Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Rid- wanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin ´ıcius 13 Carvalho Lopes. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 313–322, 2024

  62. [63]

    arXiv:2408.11198 (2024)

    Hamed Taherkhani, Melika Sepindband, Hung Viet Pham, Song Wang, and Hadi Hemmati. Epic: Cost-effective search-based prompt engineer- ing of llms for code generation. arXiv preprint arXiv:2408.11198, 2024

  63. [64]

    Fixing large language models’ specification misunderstanding for better code generation

    Zhao Tian, Junjie Chen, and Xiangyu Zhang. Fixing large language models’ specification misunderstanding for better code generation. In 2025 IEEE/ACM 47th International Conference on Software Engineer- ing (ICSE), pages 645–645. IEEE Computer Society, 2025

  64. [65]

    A critique and improvement of the cl common language effect size statistics of mcgraw and wong

    Andr ´as Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of mcgraw and wong. Journal of Educational and Behavioral Statistics , 25(2):101–132, 2000

  65. [66]

    Search-based data-flow test generation

    Mattia Vivanti, Andre Mis, Alessandra Gorla, and Gordon Fraser. Search-based data-flow test generation. In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) , pages 370–

  66. [67]

    Software testing with large language models: Survey, landscape, and vision

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering , 2024

  67. [68]

    Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing

    Wenhan Wang, Xuan Xie, Yuheng Huang, Renzhi Wang, An Ran Chen, and Lei Ma. Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing. arXiv preprint arXiv:2501.09866, 2025

  68. [69]

    MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

    Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S Yu, and Chen Xing. Projecttest: A project-level unit test generation benchmark and impact of error fixing mechanisms. arXiv preprint arXiv:2502.06556 , 2025

  69. [70]

    Towards understanding the characteristics of code generation errors made by large language models

    Zhijie Wang, Zijie Zhou, Yuheng Huang Da Song, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language models. Preprint, 2025

  70. [71]

    Clover: A test case generation benchmark with coverage, long-context, and verification

    Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, and Yingbo Zhou. Clover: A test case generation benchmark with coverage, long-context, and verification. arXiv preprint arXiv:2502.08806 , 2025

  71. [72]

    On the evaluation of large language models in unit test generation

    Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages 1607–1619, 2024

  72. [73]

    Evaluating and improving chatgpt for unit test generation

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering , 1(FSE):1703–1726, 2024

  73. [74]

    Testbench: Evaluating class-level test case generation capability of large language models

    Quanjun Zhang, Ye Shang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. Testbench: Evaluating class-level test case generation capability of large language models. arXiv preprint arXiv:2409.17561 , 2024

  74. [75]

    Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge

    Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge. arXiv preprint arXiv:2501.16155 , 2025

  75. [76]

    Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 5673–5684, 2023

  76. [77]

    Understanding and characterizing mock assertions in unit tests

    HENGCHENG ZHU, V ALERIO TERRAGNI, LILI WEI, SHING-CHI CHEUNG, JIARONG WU, and YEPANG LIU. Understanding and characterizing mock assertions in unit tests. 2025