Improving LLM-Driven Test Generation by Learning from Mocking Information

Flynn Teh; Hengcheng Zhu; Jamie Lee; Mattia Fazzini; Mengzhen Li; Valerio Terragni

arxiv: 2604.19315 · v1 · submitted 2026-04-21 · 💻 cs.SE

Improving LLM-Driven Test Generation by Learning from Mocking Information

Jamie Lee , Flynn Teh , Hengcheng Zhu , Mengzhen Li , Mattia Fazzini , Valerio Terragni This is my paper

Pith reviewed 2026-05-10 03:01 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM test generationmockingtest doublesunit testingautomated test generationJava projectsmutant killing

0 comments

The pith

Leveraging mocking information from existing tests allows LLMs to generate unit tests that cover more code and kill more mutants than standard methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether information from mocks in existing test suites can improve automated test generation by large language models. The authors develop MOCKMILL to extract stubbings and interaction expectations from test doubles and feed them into the LLM prompt, along with an iterative repair process. Evaluation on 10 Java classes shows that the resulting tests cover lines and kill mutants missed by both project tests and baseline LLM generations. A sympathetic reader would care because this suggests a practical way to make LLM test generators more effective by leveraging existing developer artifacts rather than starting from scratch.

Core claim

MOCKMILL extracts mocking information from existing tests, including stubbings and interaction expectations for components replaced by test doubles, and uses this to guide LLM-based test case generation combined with iterative generation-and-repair to produce executable tests. On 10 open-source classes from six Java projects, using four LLMs, MOCKMILL's tests cover lines of code and kill mutants that existing tests and baseline-generated tests miss.

What carries the argument

MOCKMILL, the technique that automatically extracts mocking information (stubbings and interaction expectations) from developer-written tests to guide LLM test generation.

If this is right

LLM test generation can be enhanced by incorporating domain-specific mocking patterns from existing suites.
Tests produced this way complement both manual tests and standard prompting approaches.
Iterative repair helps ensure the generated tests are executable.
This approach is applicable to Java projects with existing test suites containing mocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If mocking info is generalizable, similar techniques could apply to other test artifacts like assertions or setup code.
This might reduce the need for extensive prompt engineering in test generation tasks.
On projects without many mocks, the benefit might be limited, suggesting hybrid approaches.

Load-bearing premise

The assumption that mocking information from existing tests provides useful and generalizable guidance for generating new tests beyond what LLMs can infer from code and standard prompts alone.

What would settle it

Running MOCKMILL on additional projects or classes where existing tests have minimal or atypical mocking patterns and observing no improvement in coverage or mutant killing over baselines.

Figures

Figures reproduced from arXiv: 2604.19315 by Flynn Teh, Hengcheng Zhu, Jamie Lee, Mattia Fazzini, Mengzhen Li, Valerio Terragni.

**Figure 1.** Figure 1: shows a running example, illustrating how MOCKMILL leverages mocking information and the advantage of incorporating it. The figure includes three tests: the original developer-written test pageQueryAopLogsTest (from which MOCKMILL extracts the mocking data), the test generated by an LLM-driven baseline without providing mocking information (baseline_test), and the test generated by MOCKMILL using the ex… view at source ↗

**Figure 2.** Figure 2: High-level overview of MOCKMILL’s workflow. A. Project Analysis The Project Analysis phase identifies the dependent components that can be targets for test generation by analyzing the test code in the project under analysis. Specifically, this phase parses the abstract syntax tree (AST) of the test files in the project to identify the components that have been replaced by test doubles and that use stubbin… view at source ↗

read the original abstract

Large Language Models (LLMs) have recently shown strong potential for automated unit test generation. This has motivated us to investigate whether developer-defined test doubles (commonly referred to as mocks) available in existing test suites can be leveraged to improve LLM-driven test generation. To this end, we propose MOCKMILL, an LLM-based technique and tool that generates test cases by exploiting mocking information automatically extracted from developer-written tests. MOCKMILL targets components that are replaced by test doubles in existing tests and uses the encoded stubbings and interaction expectations to guide test generation, combined with an iterative generation-and-repair process to ensure executable tests. We evaluated MOCKMILL on 10 open-source classes from six Java projects using four LLMs, and compared the generated tests with existing project tests and tests produced by baseline approaches. The results show that MOCKMILL's tests cover lines of code and kill mutants that existing tests and baseline-generated tests miss. Overall, our findings provide preliminary evidence that leveraging mocking information is a complementary and effective way to enhance LLM-based test generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOCKMILL extracts mocking patterns from existing tests to guide LLMs but the gains may stem from the repair loop rather than the mocks themselves.

read the letter

The main thing to know is that this paper proposes MOCKMILL, which pulls stubbings and interaction expectations from developer-written tests to steer LLM test generation, combined with an iterative repair process. On 10 Java classes across six projects and four LLMs, the generated tests covered extra lines and killed mutants missed by existing tests and baselines. That is the core claim and the part that feels targeted and practical for real codebases where mocks already exist.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MOCKMILL, an LLM-based technique that extracts mocking information (stubbings and interaction expectations) from existing developer-written tests to guide the generation of unit tests targeting components replaced by test doubles. The approach combines this guidance with an iterative generation-and-repair process to produce executable tests. It is evaluated on 10 open-source Java classes from six projects using four LLMs, with the central claim that the resulting tests achieve higher line coverage and kill more mutants than both the project's existing tests and tests from baseline LLM approaches.

Significance. If the empirical results hold after addressing the evaluation gaps, this would represent a useful contribution to automated test generation by demonstrating how readily available mocking artifacts in existing test suites can provide complementary guidance to LLMs. The multi-LLM, multi-project evaluation setup offers a reasonable empirical foundation for assessing practicality in real-world Java codebases.

major comments (2)

[Evaluation section] Evaluation section: The headline results attribute superior line coverage and mutant killing to MOCKMILL's use of extracted mocking information. However, the technique also includes an iterative generation-and-repair process, and the comparison is only against unspecified baseline approaches and existing tests. No ablation is described that applies identical iteration and repair but omits the mocking extraction and stubbing guidance. Without this control, the observed gains cannot be confidently ascribed to the mocking component.
[Results section] Results presentation: The abstract and evaluation provide no details on statistical tests for the reported differences in coverage and mutant killing, the exact definition and implementation of the baseline approaches, or the criteria used to select the 10 classes and 4 LLMs. These omissions leave the central claim only weakly supported.

minor comments (2)

[Abstract] The abstract refers to 'baseline approaches' without naming or briefly describing them; this should be clarified early in the introduction or evaluation to allow readers to understand the comparisons.
[Approach description] Consider adding a brief example in the approach description showing how mocking information is automatically extracted from a sample test to improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the evaluation and results presentation.

read point-by-point responses

Referee: [Evaluation section] The headline results attribute superior line coverage and mutant killing to MOCKMILL's use of extracted mocking information. However, the technique also includes an iterative generation-and-repair process, and the comparison is only against unspecified baseline approaches and existing tests. No ablation is described that applies identical iteration and repair but omits the mocking extraction and stubbing guidance. Without this control, the observed gains cannot be confidently ascribed to the mocking component.

Authors: We agree that the current evaluation does not isolate the contribution of the mocking information from the iterative generation-and-repair process. In the revised manuscript we will add an ablation study that applies the identical iterative generation-and-repair pipeline but disables the mocking extraction and stubbing guidance. This control will allow us to attribute performance differences more precisely to the mocking component. We will also expand the description of the baseline approaches to make the comparisons explicit. revision: yes
Referee: [Results section] The abstract and evaluation provide no details on statistical tests for the reported differences in coverage and mutant killing, the exact definition and implementation of the baseline approaches, or the criteria used to select the 10 classes and 4 LLMs. These omissions leave the central claim only weakly supported.

Authors: We acknowledge these omissions weaken the support for the claims. In the revision we will: (1) report appropriate statistical tests (e.g., Wilcoxon signed-rank test with effect sizes) for all coverage and mutant-killing differences; (2) provide precise definitions, prompting templates, and implementation details for each baseline LLM approach; and (3) document the selection criteria for the 10 classes (presence of developer-written mocks, project diversity, and testability) and the four LLMs (model family, size, and access type). These additions will be placed in the Evaluation and Results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with external baselines

full rationale

The paper describes an empirical technique (MOCKMILL) that extracts mocking information from existing tests to guide LLM test generation, combined with iterative repair. Evaluation measures line coverage and mutant killing on 10 Java classes against project tests and unspecified baseline approaches. No equations, parameter fits, or derivations are present. Claims rest on direct experimental measurements rather than any self-referential reduction. Self-citations, if any, are not load-bearing for the core results, which are falsifiable against external oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about LLM code generation capabilities and the utility of mocking patterns in testing; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond domain conventions in software testing.

axioms (2)

domain assumption LLMs can produce executable tests when supplied with context about component interactions and stubbings
Invoked in the description of the iterative generation-and-repair process.
domain assumption Mocking information from developer tests encodes useful expectations for new test cases
Central to targeting components replaced by test doubles.

pith-pipeline@v0.9.0 · 5497 in / 1241 out tokens · 60501 ms · 2026-05-10T03:01:14.667560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

G. J. Myers, C. Sandler, and T. Badgett,The art of software testing. John Wiley & Sons, 2011

work page 2011
[2]

Software testing research: Achievements, challenges, dreams,

A. Bertolino, “Software testing research: Achievements, challenges, dreams,” inFOSE. IEEE, 2007, pp. 85–103

work page 2007
[3]

An orchestrated survey of methodologies for automated software test case generation,

S. Anand, E. K. Burke, T. Y . Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, P. McMinn, A. Bertolinoet al., “An orchestrated survey of methodologies for automated software test case generation,”JSS, vol. 86, no. 8, pp. 1978–2001, 2013

work page 1978
[4]

Sbft tool competition 2023 - java test case generation track,

G. Jahangirova and V . Terragni, “Sbft tool competition 2023 - java test case generation track,” inSBFT, 2023, pp. 61–64

work page 2023
[5]

Understanding llm-driven test oracle generation,

A. Bodicoat, G. Jahangirova, and V . Terragni, “Understanding llm-driven test oracle generation,” inACM AIWARE, 2025

work page 2025
[6]

Large-scale, independent and comprehensive study of the power of llms for test case generation,

W. C. Ou ´edraogo, K. Kabor ´e, H. Tian, Y . Song, A. Koyuncu, J. Klein, D. Lo, and T. F. Bissyand´e, “Large-scale, independent and comprehensive study of the power of llms for test case generation,”arXiv, 2024

work page 2024
[7]

The future of ai-driven software engineering,

V . Terragni, A. Vella, P. Roop, and K. Blincoe, “The future of ai-driven software engineering,”ACM TOSEM, 2025

work page 2025
[8]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE TSE, vol. 50, no. 1, pp. 85–105, 2023

work page 2023
[9]

Software testing with large language models: Survey, landscape, and vision,

J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,”IEEE TSE, vol. 50, no. 4, pp. 911–936, 2024

work page 2024
[10]

Meszaros,xUnit test patterns: Refactoring test code

G. Meszaros,xUnit test patterns: Refactoring test code. Pearson Education, 2007

work page 2007
[11]

An empirical study on the usage of mocking frameworks in software testing,

S. Mostafa and X. Wang, “An empirical study on the usage of mocking frameworks in software testing,” inQSIC. IEEE, 2014, pp. 127–132

work page 2014
[12]

Mock objects for testing java systems: Why and how developers use them, and how they evolve,

D. Spadini, M. Aniche, M. Bruntink, and A. Bacchelli, “Mock objects for testing java systems: Why and how developers use them, and how they evolve,”ESEM, vol. 24, pp. 1461–1498, 2019

work page 2019
[13]

Use of test doubles in android testing: An in-depth investigation,

M. Fazzini, C. Choi, J. M. Copia, G. Lee, Y . Kakehi, A. Gorla, and A. Orso, “Use of test doubles in android testing: An in-depth investigation,” inICSE, 2022, pp. 2266–2278

work page 2022
[14]

Understanding and characterizing mock assertions in unit tests,

H. Zhu, V . Terragni, L. Wei, S.-C. Cheung, J. Wu, and Y . Liu, “Understanding and characterizing mock assertions in unit tests,”ACM PACSE, vol. 2, no. FSE, pp. 554–575, 2025

work page 2025
[15]

Feedback-directed random test generation,

C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” inICSE. IEEE, 2007, pp. 75–84

work page 2007
[16]

MockMill Replication Package,

Lee, Jamie and Teh, Flynn and Zhu, Hengcheng and Li, Mengzhen and Fazzini, Mattia and Terragni, Valerio, “MockMill Replication Package,” https://doi.org/10.5281/zenodo.19490389, 2026

work page doi:10.5281/zenodo.19490389 2026
[17]

Mockito javadoc,

Mockito Framework, “Mockito javadoc,” https://javadoc.io/doc/org. mockito/mockito-core/latest/org.mockito/org/mockito/Mockito.html, 2025, accessed: March 2026

work page 2025
[18]

Mutation testing advances: an analysis and survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” inAdvances in computers. Elsevier, 2019, vol. 112, pp. 275–378

work page 2019
[19]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[20]

Mutation-guided llm-based test gener- ation at meta,

C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation-guided llm-based test generation at meta,” arXiv preprint arXiv:2501.12862, 2025

work page arXiv 2025
[21]

Llmloop: Improving llm-generated code and tests through automated iterative feedback loops,

R. Ravi, D. Bradshaw, S. Ruberto, G. Jahangirova, and V . Terragni, “Llmloop: Improving llm-generated code and tests through automated iterative feedback loops,”ICSME. IEEE, 2025

work page 2025
[22]

Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,

S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,” inASE. IEEE, 2015, pp. 201–211

work page 2015
[23]

Evolutionary improvement of assertion oracles,

V . Terragni, G. Jahangirova, P. Tonella, and M. Pezz `e, “Evolutionary improvement of assertion oracles,” inFSE, 2020, p. 1178–1189

work page 2020
[24]

Junit 5: The next generation of junit,

JUnit Team, “Junit 5: The next generation of junit,” https://junit.org/ junit5/, 2025, accessed: March 2026

work page 2025
[25]

Mockito: Tasty mocking framework for unit tests in java,

Mockito Framework, “Mockito: Tasty mocking framework for unit tests in java,” https://site.mockito.org/, 2025, accessed: March 2026

work page 2025
[26]

Measauring Software Testability Modulo Test Quality,

V . Terragni, P. Salza, and M. Pezz`e, “Measauring Software Testability Modulo Test Quality,” inICPC, 2020

work page 2020
[27]

Pit: State of the art mutation testing for java,

PIT Mutation Testing, “Pit: State of the art mutation testing for java,” https://pitest.org/, 2025, accessed: March 2026

work page 2025
[28]

Jacoco: Java code coverage library,

JaCoCo Project, “Jacoco: Java code coverage library,” https://www.jacoco. org/jacoco/, 2025, accessed: March 2026

work page 2025
[29]

Evosuite: automatic test suite generation for object-oriented software,

G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” inFSE, 2011, pp. 416–419

work page 2011
[30]

Mseqgen: object-oriented unit-test generation via mining source code,

S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and W. Schulte, “Mseqgen: object-oriented unit-test generation via mining source code,” inFSE 2009, 2009, pp. 193–202

work page 2009
[31]

Exploiting common object usage in test case generation,

G. Fraser and A. Zeller, “Exploiting common object usage in test case generation,” inICST 2011, 2011, pp. 80–89

work page 2011
[32]

Mocksniffer: Characterizing and recommending mocking decisions for unit tests,

H. Zhu, L. Wei, M. Wen, Y . Liu, S.-C. Cheung, Q. Sheng, and C. Zhou, “Mocksniffer: Characterizing and recommending mocking decisions for unit tests,” inASE, 2020, pp. 436–447

work page 2020
[33]

Stubcoder: Automated generation and repair of stub code for mock objects,

H. Zhu, L. Wei, V . Terragni, Y . Liu, S.-C. Cheung, J. Wu, Q. Sheng, B. Zhang, and L. Song, “Stubcoder: Automated generation and repair of stub code for mock objects,”ACM TOSEM, vol. 33, no. 1, 2023

work page 2023
[34]

Mimicking production behavior with generated mocks,

D. Tiwari, M. Monperrus, and B. Baudry, “Mimicking production behavior with generated mocks,”IEEE TSE, 2024

work page 2024
[35]

Automatically removing unnecessary stubbings from test suites,

M. Li and M. Fazzini, “Automatically removing unnecessary stubbings from test suites,” inICST. IEEE, 2024, pp. 233–244

work page 2024

[1] [1]

G. J. Myers, C. Sandler, and T. Badgett,The art of software testing. John Wiley & Sons, 2011

work page 2011

[2] [2]

Software testing research: Achievements, challenges, dreams,

A. Bertolino, “Software testing research: Achievements, challenges, dreams,” inFOSE. IEEE, 2007, pp. 85–103

work page 2007

[3] [3]

An orchestrated survey of methodologies for automated software test case generation,

S. Anand, E. K. Burke, T. Y . Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, P. McMinn, A. Bertolinoet al., “An orchestrated survey of methodologies for automated software test case generation,”JSS, vol. 86, no. 8, pp. 1978–2001, 2013

work page 1978

[4] [4]

Sbft tool competition 2023 - java test case generation track,

G. Jahangirova and V . Terragni, “Sbft tool competition 2023 - java test case generation track,” inSBFT, 2023, pp. 61–64

work page 2023

[5] [5]

Understanding llm-driven test oracle generation,

A. Bodicoat, G. Jahangirova, and V . Terragni, “Understanding llm-driven test oracle generation,” inACM AIWARE, 2025

work page 2025

[6] [6]

Large-scale, independent and comprehensive study of the power of llms for test case generation,

W. C. Ou ´edraogo, K. Kabor ´e, H. Tian, Y . Song, A. Koyuncu, J. Klein, D. Lo, and T. F. Bissyand´e, “Large-scale, independent and comprehensive study of the power of llms for test case generation,”arXiv, 2024

work page 2024

[7] [7]

The future of ai-driven software engineering,

V . Terragni, A. Vella, P. Roop, and K. Blincoe, “The future of ai-driven software engineering,”ACM TOSEM, 2025

work page 2025

[8] [8]

An empirical evaluation of using large language models for automated unit test generation,

M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE TSE, vol. 50, no. 1, pp. 85–105, 2023

work page 2023

[9] [9]

Software testing with large language models: Survey, landscape, and vision,

J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,”IEEE TSE, vol. 50, no. 4, pp. 911–936, 2024

work page 2024

[10] [10]

Meszaros,xUnit test patterns: Refactoring test code

G. Meszaros,xUnit test patterns: Refactoring test code. Pearson Education, 2007

work page 2007

[11] [11]

An empirical study on the usage of mocking frameworks in software testing,

S. Mostafa and X. Wang, “An empirical study on the usage of mocking frameworks in software testing,” inQSIC. IEEE, 2014, pp. 127–132

work page 2014

[12] [12]

Mock objects for testing java systems: Why and how developers use them, and how they evolve,

D. Spadini, M. Aniche, M. Bruntink, and A. Bacchelli, “Mock objects for testing java systems: Why and how developers use them, and how they evolve,”ESEM, vol. 24, pp. 1461–1498, 2019

work page 2019

[13] [13]

Use of test doubles in android testing: An in-depth investigation,

M. Fazzini, C. Choi, J. M. Copia, G. Lee, Y . Kakehi, A. Gorla, and A. Orso, “Use of test doubles in android testing: An in-depth investigation,” inICSE, 2022, pp. 2266–2278

work page 2022

[14] [14]

Understanding and characterizing mock assertions in unit tests,

H. Zhu, V . Terragni, L. Wei, S.-C. Cheung, J. Wu, and Y . Liu, “Understanding and characterizing mock assertions in unit tests,”ACM PACSE, vol. 2, no. FSE, pp. 554–575, 2025

work page 2025

[15] [15]

Feedback-directed random test generation,

C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” inICSE. IEEE, 2007, pp. 75–84

work page 2007

[16] [16]

MockMill Replication Package,

Lee, Jamie and Teh, Flynn and Zhu, Hengcheng and Li, Mengzhen and Fazzini, Mattia and Terragni, Valerio, “MockMill Replication Package,” https://doi.org/10.5281/zenodo.19490389, 2026

work page doi:10.5281/zenodo.19490389 2026

[17] [17]

Mockito javadoc,

Mockito Framework, “Mockito javadoc,” https://javadoc.io/doc/org. mockito/mockito-core/latest/org.mockito/org/mockito/Mockito.html, 2025, accessed: March 2026

work page 2025

[18] [18]

Mutation testing advances: an analysis and survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” inAdvances in computers. Elsevier, 2019, vol. 112, pp. 275–378

work page 2019

[19] [19]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901

[20] [20]

Mutation-guided llm-based test gener- ation at meta,

C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation-guided llm-based test generation at meta,” arXiv preprint arXiv:2501.12862, 2025

work page arXiv 2025

[21] [21]

Llmloop: Improving llm-generated code and tests through automated iterative feedback loops,

R. Ravi, D. Bradshaw, S. Ruberto, G. Jahangirova, and V . Terragni, “Llmloop: Improving llm-generated code and tests through automated iterative feedback loops,”ICSME. IEEE, 2025

work page 2025

[22] [22]

Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,

S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,” inASE. IEEE, 2015, pp. 201–211

work page 2015

[23] [23]

Evolutionary improvement of assertion oracles,

V . Terragni, G. Jahangirova, P. Tonella, and M. Pezz `e, “Evolutionary improvement of assertion oracles,” inFSE, 2020, p. 1178–1189

work page 2020

[24] [24]

Junit 5: The next generation of junit,

JUnit Team, “Junit 5: The next generation of junit,” https://junit.org/ junit5/, 2025, accessed: March 2026

work page 2025

[25] [25]

Mockito: Tasty mocking framework for unit tests in java,

Mockito Framework, “Mockito: Tasty mocking framework for unit tests in java,” https://site.mockito.org/, 2025, accessed: March 2026

work page 2025

[26] [26]

Measauring Software Testability Modulo Test Quality,

V . Terragni, P. Salza, and M. Pezz`e, “Measauring Software Testability Modulo Test Quality,” inICPC, 2020

work page 2020

[27] [27]

Pit: State of the art mutation testing for java,

PIT Mutation Testing, “Pit: State of the art mutation testing for java,” https://pitest.org/, 2025, accessed: March 2026

work page 2025

[28] [28]

Jacoco: Java code coverage library,

JaCoCo Project, “Jacoco: Java code coverage library,” https://www.jacoco. org/jacoco/, 2025, accessed: March 2026

work page 2025

[29] [29]

Evosuite: automatic test suite generation for object-oriented software,

G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” inFSE, 2011, pp. 416–419

work page 2011

[30] [30]

Mseqgen: object-oriented unit-test generation via mining source code,

S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and W. Schulte, “Mseqgen: object-oriented unit-test generation via mining source code,” inFSE 2009, 2009, pp. 193–202

work page 2009

[31] [31]

Exploiting common object usage in test case generation,

G. Fraser and A. Zeller, “Exploiting common object usage in test case generation,” inICST 2011, 2011, pp. 80–89

work page 2011

[32] [32]

Mocksniffer: Characterizing and recommending mocking decisions for unit tests,

H. Zhu, L. Wei, M. Wen, Y . Liu, S.-C. Cheung, Q. Sheng, and C. Zhou, “Mocksniffer: Characterizing and recommending mocking decisions for unit tests,” inASE, 2020, pp. 436–447

work page 2020

[33] [33]

Stubcoder: Automated generation and repair of stub code for mock objects,

H. Zhu, L. Wei, V . Terragni, Y . Liu, S.-C. Cheung, J. Wu, Q. Sheng, B. Zhang, and L. Song, “Stubcoder: Automated generation and repair of stub code for mock objects,”ACM TOSEM, vol. 33, no. 1, 2023

work page 2023

[34] [34]

Mimicking production behavior with generated mocks,

D. Tiwari, M. Monperrus, and B. Baudry, “Mimicking production behavior with generated mocks,”IEEE TSE, 2024

work page 2024

[35] [35]

Automatically removing unnecessary stubbings from test suites,

M. Li and M. Fazzini, “Automatically removing unnecessary stubbings from test suites,” inICST. IEEE, 2024, pp. 233–244

work page 2024