Mutation-Guided Unit Test Generation with a Large Language Model
Pith reviewed 2026-05-19 11:00 UTC · model grok-4.3
The pith
Mutation feedback in prompts lets LLMs generate unit tests that kill more mutants than coverage tools or basic prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MUTGEN is a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. The study also analyzes the reasons for live and uncovered mutants and the impact of different mutation operators on generation effectiveness.
What carries the argument
Mutation-guided prompting, the mechanism that inserts a list of surviving mutants into each new prompt so the LLM is steered toward writing tests that kill them.
If this is right
- Test suites produced this way reach higher mutation scores on the same subjects than either EvoSuite or plain LLM prompting.
- An iterative loop that re-prompts the model with updated mutant status kills additional mutants beyond a single generation pass.
- Some mutants remain live or uncovered even after iteration, revealing limits of current LLM generation.
- Different mutation operators influence how effectively the LLM can target them in subsequent prompts.
Where Pith is reading between the lines
- The same feedback pattern could be applied to other structural signals, such as data-flow or exception paths, to steer LLM tests further.
- In larger codebases the iterative cost might be traded against manual test maintenance by producing suites that need less human repair.
- The observed gap between coverage and mutation score suggests that future benchmarks should report mutation score as a primary metric rather than coverage alone.
Load-bearing premise
Mutation score is a reliable and stringent proxy for a test suite's real-world fault-detection capability.
What would settle it
A head-to-head evaluation on a set of real injected or historical bugs where the mutation-guided tests do not detect more faults than the EvoSuite or vanilla-LLM baselines.
Figures
read the original abstract
Unit tests play a vital role in uncovering potential faults in software. While tools like EvoSuite focus on maximizing code coverage, recent advances in large language models (LLMs) have shifted attention toward LLM-based test generation. However, code coverage metrics -- such as line and branch coverage -- remain overly emphasized in reported research, despite being weak indicators of a test suite's fault-detection capability. In contrast, mutation score offers a more reliable and stringent measure, as demonstrated in our findings where some test suites achieve 100% coverage but only 4% mutation score. Although a few studies consider mutation score, the effectiveness of LLMs in killing mutants remains underexplored. In this paper, we propose MUTGEN, a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. Our study also provide insights into the limitations of LLM-based generation, analyzing the reasons for live and uncovered mutants, and the impact of different mutation operators on generation effectiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MUTGEN, a mutation-guided LLM-based unit test generation approach that incorporates mutation feedback directly into prompts and uses an iterative generation mechanism to kill additional mutants. Evaluated on 204 subjects from two benchmarks, MUTGEN is claimed to significantly outperform both EvoSuite and vanilla prompt-based LLM strategies in mutation score. The work also analyzes limitations including reasons for live and uncovered mutants and the effects of different mutation operators.
Significance. If the results hold after addressing attribution concerns, the work could meaningfully advance LLM-based test generation by showing how mutation feedback can improve fault-detection proxies beyond coverage metrics or unguided prompting. The scale of the evaluation (204 subjects across two benchmarks) and the inclusion of an iterative mechanism provide practical value, along with insights into LLM limitations. The emphasis on mutation score as a more stringent metric than coverage is a strength, though its translation to real-world bugs remains an interpretive point.
major comments (2)
- [Evaluation] Evaluation section (comparison with vanilla baseline): The vanilla prompt-based strategy appears implemented as a single non-iterative prompt, while MUTGEN performs multiple rounds with mutation feedback. Without an explicit ablation that holds the number of LLM calls or iterations fixed, the measured gains in mutation score cannot be securely attributed to the mutation-guidance component rather than the benefits of additional iterations. This directly affects the central claim that mutation feedback drives the outperformance.
- [Abstract] Abstract and motivation: The claim that mutation score is a more reliable proxy is illustrated by the 100% coverage / 4% mutation score example, but the paper does not provide evidence or discussion on how well mutation score correlates with actual fault detection for LLM-generated tests in this setting. This assumption underpins the choice to prioritize mutation score over coverage.
minor comments (2)
- [Abstract] Abstract: 'Our study also provide insights' should be corrected to 'provides' for subject-verb agreement.
- [Throughout] Notation and terminology: Ensure consistent capitalization and phrasing for 'mutation score', 'mutation feedback', and 'live mutants' across sections to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important aspects of our evaluation design and motivation that we will address to strengthen the paper. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (comparison with vanilla baseline): The vanilla prompt-based strategy appears implemented as a single non-iterative prompt, while MUTGEN performs multiple rounds with mutation feedback. Without an explicit ablation that holds the number of LLM calls or iterations fixed, the measured gains in mutation score cannot be securely attributed to the mutation-guidance component rather than the benefits of additional iterations. This directly affects the central claim that mutation feedback drives the outperformance.
Authors: We agree that the current comparison leaves room for confounding between the iterative process and the mutation feedback mechanism. The vanilla baseline was implemented as a single non-iterative prompt to represent standard LLM prompting practice, while the iterative refinement in MUTGEN is enabled by feeding back live mutant information. To isolate the contribution of mutation guidance, we will add an ablation study in the revised manuscript that applies an iterative vanilla prompting strategy using the same number of LLM calls and iterations as MUTGEN, allowing direct comparison of mutation scores. revision: yes
-
Referee: [Abstract] Abstract and motivation: The claim that mutation score is a more reliable proxy is illustrated by the 100% coverage / 4% mutation score example, but the paper does not provide evidence or discussion on how well mutation score correlates with actual fault detection for LLM-generated tests in this setting. This assumption underpins the choice to prioritize mutation score over coverage.
Authors: The 100% coverage / 4% mutation score example is taken directly from test suites generated in our experiments on the benchmark subjects. We reference prior studies in the literature that have shown mutation score to be a stronger indicator of fault detection than coverage in traditional testing contexts. We acknowledge that the manuscript would benefit from more explicit discussion of this correlation specifically for LLM-generated tests. In the revision we will expand the introduction and motivation sections with additional references to studies on mutation score versus real faults and will note the current lack of direct empirical correlation data for LLM-based test generation as a limitation. revision: yes
Circularity Check
No circularity: purely empirical evaluation with external metrics
full rationale
The paper proposes MUTGEN as an LLM-based test generation technique that incorporates mutation feedback into prompts and uses an iterative mechanism. All central claims rest on direct experimental comparisons against EvoSuite and vanilla prompt baselines across 204 benchmark subjects, using mutation score as an independent, externally defined proxy for fault detection. No equations, derivations, fitted parameters, or first-principles results appear in the abstract or described method; mutation score is not constructed from the approach itself. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are referenced. The work is self-contained as a benchmark-driven empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mutation score is a more reliable indicator of fault-detection capability than code coverage metrics.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MUTGEN ... incorporates mutation feedback directly into the prompt ... iterative generation mechanism that pushes the limits of LLMs in killing additional mutants
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mutation score ... more reliable and stringent measure ... 100% coverage but only 4% mutation score
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Call-Chain-Aware LLM-Based Test Generation for Java Projects
CAT improves line coverage by 18% and branch coverage by 22% over prior LLM test generation methods by adding call-chain and dependency context from static analysis to prompts.
Reference graph
Works this paper leans on
-
[1]
Description on mutation operators, Accessed: 2025
work page 2025
-
[2]
Jacoco, Accessed: 2025
work page 2025
-
[3]
Leetcode, Accessed: 2025
work page 2025
-
[4]
Mutahunter, Accessed: 2025
work page 2025
-
[5]
Ollama, Accessed: 2025
work page 2025
-
[6]
Pitest, Accessed: 2025
work page 2025
-
[7]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
A3test: Assertion-augmented automated test case generation
Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Informa- tion and Software Technology , 176:107565, 2024
work page 2024
-
[9]
Automated unit test improvement using large lan- guage models at meta
Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. Automated unit test improvement using large lan- guage models at meta. corr abs/2402.09171 (2024). arXiv preprint arXiv:2402.09171, 10, 2024
-
[10]
An orchestrated survey of method- ologies for automated software test case generation
Saswat Anand, Edmund K Burke, Tsong Yueh Chen, John Clark, Myra B Cohen, Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia Bertolino, et al. An orchestrated survey of method- ologies for automated software test case generation. Journal of systems and software, 86(8):1978–2001, 2013
work page 1978
-
[11]
Using mutation analysis for assessing and comparing test- ing coverage criteria
James H Andrews, Lionel C Briand, Yvan Labiche, and Akbar Siami Namin. Using mutation analysis for assessing and comparing test- ing coverage criteria. IEEE Transactions on Software Engineering , 32(8):608–624, 2006
work page 2006
-
[12]
Genetic algorithms for randomized unit testing
James H Andrews, Tim Menzies, and Felix CH Li. Genetic algorithms for randomized unit testing. IEEE Transactions on Software Engineer- ing, 37(1):80–94, 2011
work page 2011
-
[13]
Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Pro- ceedings of the 33rd international conference on software engineering , pages 1–10, 2011
work page 2011
-
[14]
Multi-lingual evaluation of code generation models
Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 , 2022
-
[15]
Togll: Correct and strong test oracle generation with llms
Soneya Binta Hossain and Matthew Dwyer. Togll: Correct and strong test oracle generation with llms. arXiv e-prints, pages arXiv–2405, 2024
work page 2024
-
[16]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[17]
Automated test- case generation for rest apis using model inference search heuristic
Clinton Cao, Annibale Panichella, and Sicco Verwer. Automated test- case generation for rest apis using model inference search heuristic. arXiv preprint arXiv:2412.03420 , 2024
-
[18]
Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineer- ing (ICSE), pages 597–608. IEEE, 2017. 12
work page 2017
-
[19]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Chatunitest: A framework for llm-based test generation
Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering , pages 572–576, 2024
work page 2024
-
[21]
Rug: Turbo llm for rust unit test generation
Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. Rug: Turbo llm for rust unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 634–634. IEEE Computer Society, 2025
work page 2025
-
[22]
Automatic test program generation: a case study
Fulvio Corno, Ernesto S ´anchez, Matteo Sonza Reorda, and Giovanni Squillero. Automatic test program generation: a case study. IEEE Design & Test of Computers , 21(2):102–109, 2004
work page 2004
-
[23]
Effective test generation using pre- trained large language models and mutation testing
Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. Effective test generation using pre- trained large language models and mutation testing. Information and Software Technology, 171:107468, 2024
work page 2024
-
[24]
Leveraging large language models for enhancing the under- standability of generated unit tests
Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the under- standability of generated unit tests. arXiv preprint arXiv:2408.11710 , 2024
-
[25]
Lrasgen: Llm-based restful api specification generation
Sida Deng, Rubing Huang, Man Zhang, Chenhui Cui, Dave Towey, and Rongcun Wang. Lrasgen: Llm-based restful api specification generation. arXiv preprint arXiv:2504.16833 , 2025
-
[26]
Toga: A neural method for test oracle generation
Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering , pages 2130–2141, 2022
work page 2022
-
[27]
Large language models for software engineering: Survey and open problems
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53. IEEE, 2023
work page 2023
-
[28]
Mutation-guided llm-based test generation at meta
Christopher Foster, Abhishek Gulati, Mark Harman, Inna Harper, Ke Mao, Jillian Ritchey, Herv ´e Robert, and Shubho Sengupta. Mutation-guided llm-based test generation at meta. arXiv preprint arXiv:2501.12862, 2025
-
[29]
Evosuite: automatic test suite generation for object-oriented software
Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering , pages 416–419, 2011
work page 2011
-
[30]
A large-scale evaluation of automated unit test generation using evosuite
Gordon Fraser and Andrea Arcuri. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) , 24(2):1–42, 2014
work page 2014
-
[31]
Achieving scalable mutation-based generation of whole test suites
Gordon Fraser and Andrea Arcuri. Achieving scalable mutation-based generation of whole test suites. Empirical Software Engineering , 20(3):783–812, 2015
work page 2015
-
[32]
The prompt alchemist: Automated llm-tailored prompt optimization for test case generation
Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. The prompt alchemist: Automated llm-tailored prompt optimization for test case generation. arXiv preprint arXiv:2501.01329 , 2025
-
[33]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Llm test generation via iterative hybrid program analysis
Sijia Gu, Noor Nashid, and Ali Mesbah. Llm test generation via iterative hybrid program analysis. arXiv preprint arXiv:2503.13580 , 2025
-
[35]
Improving llm-based unit test generation via template-based repair
Siqi Gu, Chunrong Fang, Quanjun Zhang, Fangyuan Tian, Jianyi Zhou, and Zhenyu Chen. Improving llm-based unit test generation via template-based repair. arXiv preprint arXiv:2408.03095 , 2024
-
[36]
Richard Hamlet. Random testing. Encyclopedia of software Engineering, 2:971–978, 1994
work page 1994
-
[37]
Large language models for software engineering: A systematic literature review
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology , 33(8):1–79, 2024
work page 2024
-
[38]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Coverage is not strongly correlated with test suite effectiveness
Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th international conference on software engineering , pages 435–445, 2014
work page 2014
-
[40]
Mapcoder: Multi-agent code generation for competitive problem solving,
Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving,
- [41]
-
[42]
Defects4j: A database of existing faults to enable controlled testing studies for java programs
Ren ´e Just, Darioush Jalali, and Michael D Ernst. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 international symposium on software testing and analysis, pages 437–440, 2014
work page 2014
-
[43]
Augmentest: Enhancing tests with llm-driven oracles
Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi. Augmentest: Enhancing tests with llm-driven oracles. arXiv preprint arXiv:2501.17461, 2025
-
[44]
Llamaresttest: Effective rest api testing with small language models
Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. Llamaresttest: Effective rest api testing with small language models. arXiv preprint arXiv:2501.08598, 2025
-
[45]
Llm-assisted mutation for whitebox api testing
Jia Li, Jiacheng Shen, Yuxin Su, and Michael R Lyu. Llm-assisted mutation for whitebox api testing. arXiv preprint arXiv:2504.05738 , 2025
-
[46]
Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting
Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing- Chi Cheung, and Jeff Kramer. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14–26. IEEE, 2023
work page 2023
-
[47]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Llm-powered test case generation for detecting tricky bugs
Kaibo Liu, Yiyang Liu, Zhenpeng Chen, Jie M Zhang, Yudong Han, Yun Ma, Ge Li, and Gang Huang. Llm-powered test case generation for detecting tricky bugs. arXiv preprint arXiv:2404.10304 , 2024
-
[49]
Software testing and quality assurance: theory and practice
Kshirasagar Naik and Priyadarshi Tripathy. Software testing and quality assurance: theory and practice . John Wiley & Sons, 2011
work page 2011
-
[50]
Test intention guided llm-based unit test generation
Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. Test intention guided llm-based unit test generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages 779–779. IEEE Computer Society, 2025
work page 2025
-
[51]
Jartege: a tool for random generation of unit tests for java classes
Catherine Oriat. Jartege: a tool for random generation of unit tests for java classes. In International Conference on the Quality of Software Architectures, pages 242–256. Springer, 2005
work page 2005
-
[52]
Large- scale, independent and comprehensive study of the power of llms for test case generation
Wendk ˆuuni C Ou´edraogo, Kader Kabor´e, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawend´e F Bissyand´e. Large- scale, independent and comprehensive study of the power of llms for test case generation. arXiv preprint arXiv:2407.00225 , 2024
-
[53]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[54]
Randoop: feedback-directed random testing for java
Carlos Pacheco and Michael D Ernst. Randoop: feedback-directed random testing for java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pages 815–816, 2007
work page 2007
-
[55]
Feedback-directed random test generation
Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07) , pages 75–84. IEEE, 2007
work page 2007
-
[57]
Aster: Natural and multi-language unit test generation with llms
Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. Aster: Natural and multi-language unit test generation with llms. arXiv preprint arXiv:2409.03093 , 2025
-
[58]
Savitha Ravi and Michael Coblenz
Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage- guided llm-based test generation. arXiv preprint arXiv:2403.16218 , 2024
-
[59]
Combining multiple coverage criteria in search-based unit test generation
Jos ´e Miguel Rojas, Jos ´e Campos, Mattia Vivanti, Gordon Fraser, and Andrea Arcuri. Combining multiple coverage criteria in search-based unit test generation. In Search-Based Software Engineering: 7th Inter- national Symposium, SSBSE 2015, Bergamo, Italy, September 5-7, 2015, Proceedings 7, pages 93–108. Springer, 2015
work page 2015
-
[60]
Seeding strate- gies in search-based unit test generation
Jos ´e Miguel Rojas, Gordon Fraser, and Andrea Arcuri. Seeding strate- gies in search-based unit test generation. Software Testing, Verification and Reliability, 26(5):366–401, 2016
work page 2016
-
[61]
Code-aware prompting: A study of coverage-guided test generation in regression setting using llm
Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024
work page 2024
-
[62]
Using large language models to generate junit tests: An empirical study
Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Rid- wanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin ´ıcius 13 Carvalho Lopes. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 313–322, 2024
work page 2024
-
[63]
Hamed Taherkhani, Melika Sepindband, Hung Viet Pham, Song Wang, and Hadi Hemmati. Epic: Cost-effective search-based prompt engineer- ing of llms for code generation. arXiv preprint arXiv:2408.11198, 2024
-
[64]
Fixing large language models’ specification misunderstanding for better code generation
Zhao Tian, Junjie Chen, and Xiangyu Zhang. Fixing large language models’ specification misunderstanding for better code generation. In 2025 IEEE/ACM 47th International Conference on Software Engineer- ing (ICSE), pages 645–645. IEEE Computer Society, 2025
work page 2025
-
[65]
A critique and improvement of the cl common language effect size statistics of mcgraw and wong
Andr ´as Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of mcgraw and wong. Journal of Educational and Behavioral Statistics , 25(2):101–132, 2000
work page 2000
-
[66]
Search-based data-flow test generation
Mattia Vivanti, Andre Mis, Alessandra Gorla, and Gordon Fraser. Search-based data-flow test generation. In 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE) , pages 370–
work page 2013
-
[67]
Software testing with large language models: Survey, landscape, and vision
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering , 2024
work page 2024
-
[68]
Wenhan Wang, Xuan Xie, Yuheng Huang, Renzhi Wang, An Ran Chen, and Lei Ma. Fine-grained testing for autonomous driving software: a study on autoware with llm-driven unit testing. arXiv preprint arXiv:2501.09866, 2025
-
[69]
Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S Yu, and Chen Xing. Projecttest: A project-level unit test generation benchmark and impact of error fixing mechanisms. arXiv preprint arXiv:2502.06556 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Towards understanding the characteristics of code generation errors made by large language models
Zhijie Wang, Zijie Zhou, Yuheng Huang Da Song, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language models. Preprint, 2025
work page 2025
-
[71]
Clover: A test case generation benchmark with coverage, long-context, and verification
Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, and Yingbo Zhou. Clover: A test case generation benchmark with coverage, long-context, and verification. arXiv preprint arXiv:2502.08806 , 2025
-
[72]
On the evaluation of large language models in unit test generation
Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages 1607–1619, 2024
work page 2024
-
[73]
Evaluating and improving chatgpt for unit test generation
Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering , 1(FSE):1703–1726, 2024
work page 2024
-
[74]
Testbench: Evaluating class-level test case generation capability of large language models
Quanjun Zhang, Ye Shang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. Testbench: Evaluating class-level test case generation capability of large language models. arXiv preprint arXiv:2409.17561 , 2024
-
[75]
Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge. arXiv preprint arXiv:2501.16155 , 2025
-
[76]
Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre- trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 5673–5684, 2023
work page 2023
-
[77]
Understanding and characterizing mock assertions in unit tests
HENGCHENG ZHU, V ALERIO TERRAGNI, LILI WEI, SHING-CHI CHEUNG, JIARONG WU, and YEPANG LIU. Understanding and characterizing mock assertions in unit tests. 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.