Improving LLM-Driven Test Generation by Learning from Mocking Information
Pith reviewed 2026-05-10 03:01 UTC · model grok-4.3
The pith
Leveraging mocking information from existing tests allows LLMs to generate unit tests that cover more code and kill more mutants than standard methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MOCKMILL extracts mocking information from existing tests, including stubbings and interaction expectations for components replaced by test doubles, and uses this to guide LLM-based test case generation combined with iterative generation-and-repair to produce executable tests. On 10 open-source classes from six Java projects, using four LLMs, MOCKMILL's tests cover lines of code and kill mutants that existing tests and baseline-generated tests miss.
What carries the argument
MOCKMILL, the technique that automatically extracts mocking information (stubbings and interaction expectations) from developer-written tests to guide LLM test generation.
If this is right
- LLM test generation can be enhanced by incorporating domain-specific mocking patterns from existing suites.
- Tests produced this way complement both manual tests and standard prompting approaches.
- Iterative repair helps ensure the generated tests are executable.
- This approach is applicable to Java projects with existing test suites containing mocks.
Where Pith is reading between the lines
- If mocking info is generalizable, similar techniques could apply to other test artifacts like assertions or setup code.
- This might reduce the need for extensive prompt engineering in test generation tasks.
- On projects without many mocks, the benefit might be limited, suggesting hybrid approaches.
Load-bearing premise
The assumption that mocking information from existing tests provides useful and generalizable guidance for generating new tests beyond what LLMs can infer from code and standard prompts alone.
What would settle it
Running MOCKMILL on additional projects or classes where existing tests have minimal or atypical mocking patterns and observing no improvement in coverage or mutant killing over baselines.
Figures
read the original abstract
Large Language Models (LLMs) have recently shown strong potential for automated unit test generation. This has motivated us to investigate whether developer-defined test doubles (commonly referred to as mocks) available in existing test suites can be leveraged to improve LLM-driven test generation. To this end, we propose MOCKMILL, an LLM-based technique and tool that generates test cases by exploiting mocking information automatically extracted from developer-written tests. MOCKMILL targets components that are replaced by test doubles in existing tests and uses the encoded stubbings and interaction expectations to guide test generation, combined with an iterative generation-and-repair process to ensure executable tests. We evaluated MOCKMILL on 10 open-source classes from six Java projects using four LLMs, and compared the generated tests with existing project tests and tests produced by baseline approaches. The results show that MOCKMILL's tests cover lines of code and kill mutants that existing tests and baseline-generated tests miss. Overall, our findings provide preliminary evidence that leveraging mocking information is a complementary and effective way to enhance LLM-based test generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MOCKMILL, an LLM-based technique that extracts mocking information (stubbings and interaction expectations) from existing developer-written tests to guide the generation of unit tests targeting components replaced by test doubles. The approach combines this guidance with an iterative generation-and-repair process to produce executable tests. It is evaluated on 10 open-source Java classes from six projects using four LLMs, with the central claim that the resulting tests achieve higher line coverage and kill more mutants than both the project's existing tests and tests from baseline LLM approaches.
Significance. If the empirical results hold after addressing the evaluation gaps, this would represent a useful contribution to automated test generation by demonstrating how readily available mocking artifacts in existing test suites can provide complementary guidance to LLMs. The multi-LLM, multi-project evaluation setup offers a reasonable empirical foundation for assessing practicality in real-world Java codebases.
major comments (2)
- [Evaluation section] Evaluation section: The headline results attribute superior line coverage and mutant killing to MOCKMILL's use of extracted mocking information. However, the technique also includes an iterative generation-and-repair process, and the comparison is only against unspecified baseline approaches and existing tests. No ablation is described that applies identical iteration and repair but omits the mocking extraction and stubbing guidance. Without this control, the observed gains cannot be confidently ascribed to the mocking component.
- [Results section] Results presentation: The abstract and evaluation provide no details on statistical tests for the reported differences in coverage and mutant killing, the exact definition and implementation of the baseline approaches, or the criteria used to select the 10 classes and 4 LLMs. These omissions leave the central claim only weakly supported.
minor comments (2)
- [Abstract] The abstract refers to 'baseline approaches' without naming or briefly describing them; this should be clarified early in the introduction or evaluation to allow readers to understand the comparisons.
- [Approach description] Consider adding a brief example in the approach description showing how mocking information is automatically extracted from a sample test to improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the evaluation and results presentation.
read point-by-point responses
-
Referee: [Evaluation section] The headline results attribute superior line coverage and mutant killing to MOCKMILL's use of extracted mocking information. However, the technique also includes an iterative generation-and-repair process, and the comparison is only against unspecified baseline approaches and existing tests. No ablation is described that applies identical iteration and repair but omits the mocking extraction and stubbing guidance. Without this control, the observed gains cannot be confidently ascribed to the mocking component.
Authors: We agree that the current evaluation does not isolate the contribution of the mocking information from the iterative generation-and-repair process. In the revised manuscript we will add an ablation study that applies the identical iterative generation-and-repair pipeline but disables the mocking extraction and stubbing guidance. This control will allow us to attribute performance differences more precisely to the mocking component. We will also expand the description of the baseline approaches to make the comparisons explicit. revision: yes
-
Referee: [Results section] The abstract and evaluation provide no details on statistical tests for the reported differences in coverage and mutant killing, the exact definition and implementation of the baseline approaches, or the criteria used to select the 10 classes and 4 LLMs. These omissions leave the central claim only weakly supported.
Authors: We acknowledge these omissions weaken the support for the claims. In the revision we will: (1) report appropriate statistical tests (e.g., Wilcoxon signed-rank test with effect sizes) for all coverage and mutant-killing differences; (2) provide precise definitions, prompting templates, and implementation details for each baseline LLM approach; and (3) document the selection criteria for the 10 classes (presence of developer-written mocks, project diversity, and testability) and the four LLMs (model family, size, and access type). These additions will be placed in the Evaluation and Results sections. revision: yes
Circularity Check
No circularity: empirical evaluation with external baselines
full rationale
The paper describes an empirical technique (MOCKMILL) that extracts mocking information from existing tests to guide LLM test generation, combined with iterative repair. Evaluation measures line coverage and mutant killing on 10 Java classes against project tests and unspecified baseline approaches. No equations, parameter fits, or derivations are present. Claims rest on direct experimental measurements rather than any self-referential reduction. Self-citations, if any, are not load-bearing for the core results, which are falsifiable against external oracles.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can produce executable tests when supplied with context about component interactions and stubbings
- domain assumption Mocking information from developer tests encodes useful expectations for new test cases
Reference graph
Works this paper leans on
-
[1]
G. J. Myers, C. Sandler, and T. Badgett,The art of software testing. John Wiley & Sons, 2011
work page 2011
-
[2]
Software testing research: Achievements, challenges, dreams,
A. Bertolino, “Software testing research: Achievements, challenges, dreams,” inFOSE. IEEE, 2007, pp. 85–103
work page 2007
-
[3]
An orchestrated survey of methodologies for automated software test case generation,
S. Anand, E. K. Burke, T. Y . Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, P. McMinn, A. Bertolinoet al., “An orchestrated survey of methodologies for automated software test case generation,”JSS, vol. 86, no. 8, pp. 1978–2001, 2013
work page 1978
-
[4]
Sbft tool competition 2023 - java test case generation track,
G. Jahangirova and V . Terragni, “Sbft tool competition 2023 - java test case generation track,” inSBFT, 2023, pp. 61–64
work page 2023
-
[5]
Understanding llm-driven test oracle generation,
A. Bodicoat, G. Jahangirova, and V . Terragni, “Understanding llm-driven test oracle generation,” inACM AIWARE, 2025
work page 2025
-
[6]
Large-scale, independent and comprehensive study of the power of llms for test case generation,
W. C. Ou ´edraogo, K. Kabor ´e, H. Tian, Y . Song, A. Koyuncu, J. Klein, D. Lo, and T. F. Bissyand´e, “Large-scale, independent and comprehensive study of the power of llms for test case generation,”arXiv, 2024
work page 2024
-
[7]
The future of ai-driven software engineering,
V . Terragni, A. Vella, P. Roop, and K. Blincoe, “The future of ai-driven software engineering,”ACM TOSEM, 2025
work page 2025
-
[8]
An empirical evaluation of using large language models for automated unit test generation,
M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE TSE, vol. 50, no. 1, pp. 85–105, 2023
work page 2023
-
[9]
Software testing with large language models: Survey, landscape, and vision,
J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,”IEEE TSE, vol. 50, no. 4, pp. 911–936, 2024
work page 2024
-
[10]
Meszaros,xUnit test patterns: Refactoring test code
G. Meszaros,xUnit test patterns: Refactoring test code. Pearson Education, 2007
work page 2007
-
[11]
An empirical study on the usage of mocking frameworks in software testing,
S. Mostafa and X. Wang, “An empirical study on the usage of mocking frameworks in software testing,” inQSIC. IEEE, 2014, pp. 127–132
work page 2014
-
[12]
Mock objects for testing java systems: Why and how developers use them, and how they evolve,
D. Spadini, M. Aniche, M. Bruntink, and A. Bacchelli, “Mock objects for testing java systems: Why and how developers use them, and how they evolve,”ESEM, vol. 24, pp. 1461–1498, 2019
work page 2019
-
[13]
Use of test doubles in android testing: An in-depth investigation,
M. Fazzini, C. Choi, J. M. Copia, G. Lee, Y . Kakehi, A. Gorla, and A. Orso, “Use of test doubles in android testing: An in-depth investigation,” inICSE, 2022, pp. 2266–2278
work page 2022
-
[14]
Understanding and characterizing mock assertions in unit tests,
H. Zhu, V . Terragni, L. Wei, S.-C. Cheung, J. Wu, and Y . Liu, “Understanding and characterizing mock assertions in unit tests,”ACM PACSE, vol. 2, no. FSE, pp. 554–575, 2025
work page 2025
-
[15]
Feedback-directed random test generation,
C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, “Feedback-directed random test generation,” inICSE. IEEE, 2007, pp. 75–84
work page 2007
-
[16]
Lee, Jamie and Teh, Flynn and Zhu, Hengcheng and Li, Mengzhen and Fazzini, Mattia and Terragni, Valerio, “MockMill Replication Package,” https://doi.org/10.5281/zenodo.19490389, 2026
-
[17]
Mockito Framework, “Mockito javadoc,” https://javadoc.io/doc/org. mockito/mockito-core/latest/org.mockito/org/mockito/Mockito.html, 2025, accessed: March 2026
work page 2025
-
[18]
Mutation testing advances: an analysis and survey,
M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,” inAdvances in computers. Elsevier, 2019, vol. 112, pp. 275–378
work page 2019
-
[19]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[20]
Mutation-guided llm-based test gener- ation at meta,
C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation-guided llm-based test generation at meta,” arXiv preprint arXiv:2501.12862, 2025
-
[21]
Llmloop: Improving llm-generated code and tests through automated iterative feedback loops,
R. Ravi, D. Bradshaw, S. Ruberto, G. Jahangirova, and V . Terragni, “Llmloop: Improving llm-generated code and tests through automated iterative feedback loops,”ICSME. IEEE, 2025
work page 2025
-
[22]
S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges,” inASE. IEEE, 2015, pp. 201–211
work page 2015
-
[23]
Evolutionary improvement of assertion oracles,
V . Terragni, G. Jahangirova, P. Tonella, and M. Pezz `e, “Evolutionary improvement of assertion oracles,” inFSE, 2020, p. 1178–1189
work page 2020
-
[24]
Junit 5: The next generation of junit,
JUnit Team, “Junit 5: The next generation of junit,” https://junit.org/ junit5/, 2025, accessed: March 2026
work page 2025
-
[25]
Mockito: Tasty mocking framework for unit tests in java,
Mockito Framework, “Mockito: Tasty mocking framework for unit tests in java,” https://site.mockito.org/, 2025, accessed: March 2026
work page 2025
-
[26]
Measauring Software Testability Modulo Test Quality,
V . Terragni, P. Salza, and M. Pezz`e, “Measauring Software Testability Modulo Test Quality,” inICPC, 2020
work page 2020
-
[27]
Pit: State of the art mutation testing for java,
PIT Mutation Testing, “Pit: State of the art mutation testing for java,” https://pitest.org/, 2025, accessed: March 2026
work page 2025
-
[28]
Jacoco: Java code coverage library,
JaCoCo Project, “Jacoco: Java code coverage library,” https://www.jacoco. org/jacoco/, 2025, accessed: March 2026
work page 2025
-
[29]
Evosuite: automatic test suite generation for object-oriented software,
G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” inFSE, 2011, pp. 416–419
work page 2011
-
[30]
Mseqgen: object-oriented unit-test generation via mining source code,
S. Thummalapenta, T. Xie, N. Tillmann, J. de Halleux, and W. Schulte, “Mseqgen: object-oriented unit-test generation via mining source code,” inFSE 2009, 2009, pp. 193–202
work page 2009
-
[31]
Exploiting common object usage in test case generation,
G. Fraser and A. Zeller, “Exploiting common object usage in test case generation,” inICST 2011, 2011, pp. 80–89
work page 2011
-
[32]
Mocksniffer: Characterizing and recommending mocking decisions for unit tests,
H. Zhu, L. Wei, M. Wen, Y . Liu, S.-C. Cheung, Q. Sheng, and C. Zhou, “Mocksniffer: Characterizing and recommending mocking decisions for unit tests,” inASE, 2020, pp. 436–447
work page 2020
-
[33]
Stubcoder: Automated generation and repair of stub code for mock objects,
H. Zhu, L. Wei, V . Terragni, Y . Liu, S.-C. Cheung, J. Wu, Q. Sheng, B. Zhang, and L. Song, “Stubcoder: Automated generation and repair of stub code for mock objects,”ACM TOSEM, vol. 33, no. 1, 2023
work page 2023
-
[34]
Mimicking production behavior with generated mocks,
D. Tiwari, M. Monperrus, and B. Baudry, “Mimicking production behavior with generated mocks,”IEEE TSE, 2024
work page 2024
-
[35]
Automatically removing unnecessary stubbings from test suites,
M. Li and M. Fazzini, “Automatically removing unnecessary stubbings from test suites,” inICST. IEEE, 2024, pp. 233–244
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.