LLM-based Mockless Unit Test Generation for Java

Guancheng Wang; Kui Liu; Lionel Briand; Qinghua Xu; Zhaoqiang Guo

arxiv: 2605.26851 · v1 · pith:VFC7M3ZAnew · submitted 2026-05-26 · 💻 cs.SE

LLM-based Mockless Unit Test Generation for Java

Qinghua Xu , Guancheng Wang , Lionel Briand , Zhaoqiang Guo , Kui Liu This is my paper

Pith reviewed 2026-06-29 15:45 UTC · model grok-4.3

classification 💻 cs.SE

keywords mockless unit testingLLM test generationJavahallucination mitigationtest coveragedependency exercise

0 comments

The pith

MocklessTester generates Java unit tests without mocks by mining usage patterns and enforcing constraints to reduce LLM hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MocklessTester as a way to generate unit tests for Java classes that avoid mocking frameworks entirely. It tackles two sources of LLM errors in test code by first pulling real usage patterns from existing code to supply missing context, then applying a two-stage repair process that checks symbol use, protocol compliance, and iteration behavior. On standard benchmarks the resulting tests reach higher line and branch coverage while also running more lines of actual dependency code rather than simulated versions.

Core claim

MocklessTester mitigates not knowing hallucinations through context-enriched generation that mines real usage patterns from existing code and mitigates not following hallucinations through constraint-enforced fixing that performs two-stage repair under symbol-, protocol-, and iteration-level constraints using a ClassIndex, a Markov typestate model, and experience memory, yielding higher line coverage, branch coverage, mutation scores, and additional dependency class lines exercised on Defects4J and Deps4J.

What carries the argument

Context-enriched generation combined with two-stage constraint-enforced fixing that uses ClassIndex, Markov typestate model, and experience memory to enforce symbol, protocol, and iteration rules.

If this is right

Line coverage rises by 19.99 percent on one benchmark and 22.69 percent on the other.
Branch coverage rises by 24.90 percent and 15.78 percent respectively.
Mutation scores rise by 13.67 percent and 0.17 percent.
Tests cover 378 and 55 additional lines of real dependency code on the two benchmarks.
Total token and time costs increase but stay practical at roughly 70 to 109 seconds and 25k tokens per method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mockless tests could expose interaction bugs that mocks hide by exercising actual dependency implementations.
The same pattern-mining and constraint-repair steps might reduce hallucinations in other LLM code-generation tasks such as API usage or refactoring.
Combining mockless generation with selective mocking for only the hardest dependencies could balance coverage gains against runtime cost.

Load-bearing premise

Mining real usage patterns from code and applying symbol-protocol-iteration repairs will prevent the LLM from producing invalid mockless tests.

What would settle it

Re-running the evaluation on Defects4J and Deps4J and observing no improvement in line or branch coverage over the baseline would show the approach does not deliver the claimed gains.

Figures

Figures reproduced from arXiv: 2605.26851 by Guancheng Wang, Kui Liu, Lionel Briand, Qinghua Xu, Zhaoqiang Guo.

**Figure 2.** Figure 2: Example of a typestate model for JacksonXML. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of MocklessTester [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt Template for PLANNER 2) PLANNER: The PLANNER turns the current coverage gap into a concrete test plan. It builds a control-flow graph for every method of the CUT using COMEX [16], a tree-sitterbased code-view generation tool that extracts structural program representations such as control-flow and data-flow graphs, and then enumerates the resulting paths. Following the practice in PANTA [7] and to … view at source ↗

**Figure 6.** Figure 6: Prompt Template for FIXER I [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt Template for FIXER II IV. EXPERIMENTAL DESIGN A. Research Questions We design our empirical evaluation around three research questions as follows. • RQ1 (Effectiveness): How effective is MocklessTester in generating tests for Java repositories compared with the SOTA baseline? • RQ2 (Efficiency): What efficiency cost does MocklessTester incur, in terms of wall-clock time and token consumption, relati… view at source ↗

read the original abstract

Large language models (LLMs) have shown strong potential for automated test generation, yet most approaches to generating Java unit tests still rely on mocking frameworks to handle dependencies. Mockless test generation could exercise more real low-level code, but it faces challenges such as invalid test code generation due to hallucination, strict language constraints, and inadequate dependency awareness. We identify two causes behind these hallucinations: not knowing, where the LLM lacks sufficient context, and not following, where the LLM fails to comply with constraints even when they are provided. We present MocklessTester, a mockless unit test generation approach built around two strategies: context-enriched generation and constraint-enforced fixing. To mitigate not knowing, context-enriched generation mines real usage patterns from existing code to generate tests. To mitigate not following, constraint-enforced fixing performs two-stage repair under symbol-, protocol-, and iteration-level constraints, using a ClassIndex, a Markov typestate model, and experience memory. We evaluate MocklessTester against the state-of-the-art baseline on Defects4J and Deps4J. Results show that MocklessTester improves line coverage by 19.99% and 22.69% and branch coverage by 24.90% and 15.78% on the two benchmarks, respectively, and improves mutation score by 13.67% and 0.17%. Beyond the class under test, MocklessTester also exercises more real dependency code, covering 378 and 55 additional lines in dependency classes, respectively. The improvement in test quality comes with higher total token and time costs than the baseline. Nevertheless, the cost per method remains practical, averaging 108.97 seconds and 26.59k tokens on Defects4J, and 69.85 seconds and 25.46k tokens on Deps4J. Ablation results confirm that all major components contribute positively to the final performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MocklessTester gets measurable coverage gains on two benchmarks by mining usage patterns and running constrained two-stage repair, with ablations backing the pieces.

read the letter

The main takeaway is that MocklessTester improves line and branch coverage over the baseline on Defects4J and Deps4J by mining real usage patterns for context and then applying two-stage repair under symbol, protocol, and iteration constraints with ClassIndex, a Markov typestate model, and experience memory. It also covers extra lines in dependency classes.

What is new is the explicit split between fixing the 'not knowing' and 'not following' hallucination modes with those concrete components. The context mining step and the repair stages are a direct response to the stated problems, and the paper reports that ablations show each major piece adds value.

The evaluation is straightforward: concrete percentage lifts, extra dependency coverage as a mockless-specific win, acknowledgment that total cost is higher but per-method cost stays practical, and ablation results. That package is better than many LLM test papers that skip the breakdown.

The soft spots are limited. The mutation score lift is 13.67% on one benchmark but only 0.17% on the other, so the quality claim rests more on coverage than on fault detection. A reader would want to see exactly how the baseline was reproduced and whether statistical tests back the differences, given LLM run-to-run variation.

This is for people working on LLM-based test generation or mockless testing in Java. A reader who needs concrete techniques and numbers on standard benchmarks will get value from the design and the ablation data.

The work is grounded enough to deserve a serious referee. The problem is real, the approach is specified, and the evidence includes public benchmarks plus component checks.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MocklessTester, an LLM-based approach for generating mockless Java unit tests. It identifies two hallucination modes ('not knowing' and 'not following') and mitigates them via context-enriched generation (mining real usage patterns from existing code) and constraint-enforced fixing (two-stage repair under symbol-, protocol-, and iteration-level constraints using ClassIndex, a Markov typestate model, and experience memory). On Defects4J and Deps4J, it reports line coverage gains of 19.99% and 22.69%, branch coverage gains of 24.90% and 15.78%, mutation score gains of 13.67% and 0.17%, plus 378 and 55 additional dependency lines covered, with ablations confirming component contributions at higher token/time cost.

Significance. If the results hold, the work is significant for software engineering as it reduces reliance on mocks in automated test generation, enabling tests that exercise more real dependency code. Strengths include the explicit two-hallucination framing, ablation validation of ClassIndex/Markov/experience memory components, and acknowledgment that per-method costs remain practical (108.97s/26.59k tokens on Defects4J). This provides a concrete, empirically grounded advance over prior LLM test generators.

major comments (2)

[Evaluation] Evaluation section: The manuscript reports specific percentage improvements (e.g., 19.99% line coverage, 13.67% mutation score) but provides insufficient detail on experimental methodology, including number of runs per method, statistical significance tests, or result variance. This is load-bearing for the central empirical claims in the abstract.
[Method] Constraint-enforced fixing description: The Markov typestate model for protocol-level constraints lacks detail on training data, state definitions, or accuracy metrics, making it difficult to assess its specific contribution to mitigating 'not following' hallucinations despite the ablation results.

minor comments (2)

[Results] The 0.17% mutation score gain on Deps4J is marginal; the paper should explicitly discuss its practical significance in the results or discussion section.
[Abstract] Ensure consistent use of 'respectively' when reporting paired metrics across the two benchmarks in the abstract and results tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The manuscript reports specific percentage improvements (e.g., 19.99% line coverage, 13.67% mutation score) but provides insufficient detail on experimental methodology, including number of runs per method, statistical significance tests, or result variance. This is load-bearing for the central empirical claims in the abstract.

Authors: We agree this detail is necessary for reproducibility and to support the reported gains. The original experiments used 5 independent runs per method to mitigate LLM nondeterminism, with results reported as means; however, variance, standard deviations, and statistical tests were omitted. In the revision we will add a new subsection (Evaluation Setup) that explicitly states the run count, reports mean ± std, and includes paired t-test p-values (all < 0.05) against the baseline. This directly addresses the load-bearing concern. revision: yes
Referee: [Method] Constraint-enforced fixing description: The Markov typestate model for protocol-level constraints lacks detail on training data, state definitions, or accuracy metrics, making it difficult to assess its specific contribution to mitigating 'not following' hallucinations despite the ablation results.

Authors: We accept that the description is currently too terse. The Markov model was trained on call-sequence traces mined via ClassIndex from the training portions of Defects4J and Deps4J; states encode typestates of dependency classes (e.g., FileInputStream: closed/open/reading), and transition probabilities were estimated by maximum likelihood. On a held-out trace set the model achieved 91.4% next-state accuracy. We will expand the Method section with these specifics plus a short table of state definitions, while retaining the existing ablation that isolates its contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper describes an empirical software engineering approach (MocklessTester) that mines usage patterns and applies constrained repair for LLM-based test generation. All reported results (coverage gains, mutation scores, dependency exercise) are obtained from direct evaluation on public benchmarks (Defects4J, Deps4J) with ablation studies. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided material. The central claims rest on experimental outcomes rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical software engineering paper with no mathematical derivations, free parameters, axioms, or invented entities described in the abstract.

pith-pipeline@v0.9.1-grok · 5887 in / 1313 out tokens · 42554 ms · 2026-06-29T15:45:32.402234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Beck,Test driven development: By example

K. Beck,Test driven development: By example. Addison- Wesley Professional, 2022

2022
[2]

Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,

Q. Xu, G. Wang, L. Briand, and K. Liu, “Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,”ACM Transactions on Software Engineering and Methodology, 2026

2026
[3]

Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark,

B. Yu, Y . Cao, Y . Zhang, L. Lin, J. Xu, Z. Zhong, Q. Xu, G. Wang, J. Cao, S.-C. Cheunget al., “Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark,”arXiv preprint arXiv:2603.00520, 2026

work page arXiv 2026
[4]

A large-scale evaluation of automated unit test generation using evosuite,

G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test generation using evosuite,”ACM Trans. Softw. Eng. Methodol., vol. 24, no. 2, Dec. 2014. [Online]. Available: https://doi.org/10.1145/2685612

work page doi:10.1145/2685612 2014
[5]

Randoop: feedback-directed random testing for java,

C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” inCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming sys- tems and applications companion, 2007, pp. 815–816

2007
[6]

Mutation-Guided Unit Test Generation with a Large Language Model

G. Wang, Q. Xu, L. C. Briand, and K. Liu, “Mutation- guided unit test generation with a large language model,” arXiv preprint arXiv:2506.02954, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Llm test genera- tion via iterative hybrid program analysis,

S. Gu, N. Nashid, and A. Mesbah, “Llm test genera- tion via iterative hybrid program analysis,”arXiv preprint arXiv:2503.13580, 2025

work page arXiv 2025
[8]

Large language models for unit test generation: Achievements, challenges, and opportunities,

B. Chu, Y . Feng, K. Liu, Z. Guo, Y . Zhang, H. Shi, Z. Nan, and B. Xu, “Large language models for unit test generation: Achievements, challenges, and opportunities,” arXiv preprint arXiv:2511.21382, 2025

work page arXiv 2025
[9]

Coverup: Effective high coverage test generation for python,

J. Altmayer Pizzorno and E. D. Berger, “Coverup: Effective high coverage test generation for python,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2897– 2919, 2025

2025
[10]

Llm-based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “Llm-based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, vol. 50, no. 9, pp. 2254–2268, 2024

2024
[11]

TestEval: Benchmarking large language models for test case generation,

W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “TestEval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New ...

2020
[13]

To mock or not to mock? an empirical study on mocking prac- tices,

D. Spadini, M. Aniche, M. Bruntink, and A. Bacchelli, “To mock or not to mock? an empirical study on mocking prac- tices,” in2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 402–412

2017
[14]

Togll: Correct and strong test oracle generation with llms,

S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,”arXiv preprint arXiv:2405.03786, 2024

work page arXiv 2024
[15]

Modeling and discovering vulnerabilities with code property graphs,

F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” inProceedings of the 2014 IEEE Symposium on Security and Privacy (S&P). IEEE, 2014, pp. 590–604

2014
[16]

Comex: A tool for generating customized source code representa- tions,

D. Das, N. S. Mathews, A. Mathai, S. Tamilselvam, K. Sedamaki, S. Chimalakonda, and A. Kumar, “Comex: A tool for generating customized source code representa- tions,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 2054–2057

2023
[17]

Agent development kit for java (adk-java),

Google, “Agent development kit for java (adk-java),” https: //github.com/google/adk-java, 2025, accessed March 2026

2025
[18]

Fit framework: Model- engineering utilities for java,

Huawei ModelEngine Group, “Fit framework: Model- engineering utilities for java,” https://github.com/ ModelEngine-Group/fit-framework, 2025, accessed March 2026

2025
[19]

Agentscope: A flexible yet ro- bust multi-agent platform (java runtime),

AgentScope Contributors, “Agentscope: A flexible yet ro- bust multi-agent platform (java runtime),” https://github. com/agentscope-ai/agentscope-java, 2025, accessed March 2026

2025
[20]

Agent-to-agent (a2a) sdk for java,

A2A Project Contributors, “Agent-to-agent (a2a) sdk for java,” https://github.com/a2aproject/a2a-java, 2025, accessed March 2026

2025
[21]

jquick-curl: An antlr-based curl command parser for java,

H. Pao, “jquick-curl: An antlr-based curl command parser for java,” https://github.com/paohaijiao/jquick-curl, 2025, accessed March 2026

2025
[22]

Chatunitest: A framework for llm-based test generation,

Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 572–576

2024
[23]

64 Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman

A. Deljouyi, R. Koohestani, M. Izadi, and A. Zaidman, “Leveraging large language models for enhancing the understandability of generated unit tests,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 1449–1461. [Online]. Available: https://doi.org/10.1109/ ICSE55347.2025.00032

work page arXiv 2025
[24]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ra- manathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 951–971, 2024

2024
[25]

PIT: A practical mutation testing tool for Java (demo),

H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ven- tresque, “PIT: A practical mutation testing tool for Java (demo),” inProceedings of the 25th International Sympo- sium on Software Testing and Analysis (ISSTA). ACM, 2016, pp. 449–452

2016
[26]

A practical guide for using statistical tests to assess randomized algorithms in software engineering,

A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” inProceedings of the 33rd international con- ference on software engineering, 2011, pp. 1–10

2011
[27]

A critique and improvement of the cl common language effect size statistics of mcgraw and wong,

A. Vargha and H. D. Delaney, “A critique and improvement of the cl common language effect size statistics of mcgraw and wong,”Journal of Educational and Behavioral Statis- tics, vol. 25, no. 2, pp. 101–132, 2000

2000
[28]

Langchain official website,

“Langchain official website,” Accessed: 2025. [Online]. Available: https://www.langchain.com/

2025
[29]

Qwen3-Coder: Agentic coding models,

Qwen Team, “Qwen3-Coder: Agentic coding models,” https://qwenlm.github.io/blog/qwen3-coder/, 2025, accessed March 2026

2025
[30]

Efficient memory management for large language model serving with pagedat- tention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedat- tention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

2023
[31]

Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation,

H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation,” in Proceedings of the ACM on Web Conference 2025, 2025, pp. 2366–2377

2025
[32]

EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair

F. Mu, J. Wang, L. Shi, S. Wang, S. Li, and Q. Wang, “Experepair: Dual-memory enhanced llm-based repository- level program repair,”arXiv preprint arXiv:2506.10484, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Citrus: Automated unit testing tool for real-world c++ programs,

R. S. Herlim, Y . Kim, and M. Kim, “Citrus: Automated unit testing tool for real-world c++ programs,” in2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, 2022, pp. 400–410

2022
[34]

Pynguin: Automated unit test generation for python,

S. Lukasczyk and G. Fraser, “Pynguin: Automated unit test generation for python,” inProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 168–172

2022
[35]

Test generation for embedded executables via concolic execution in a real environment,

T. Chen, X.-S. Zhang, X.-L. Ji, C. Zhu, Y . Bai, and Y . Wu, “Test generation for embedded executables via concolic execution in a real environment,”IEEE Transactions on Reliability, vol. 64, no. 1, pp. 284–296, 2014

2014
[36]

Feedback-directed unit test generation for c/c++ using concolic execution,

P. Garg, F. Ivan ˇci´c, G. Balakrishnan, N. Maeda, and A. Gupta, “Feedback-directed unit test generation for c/c++ using concolic execution,” in2013 35th International Con- ference on Software Engineering (ICSE). IEEE, 2013, pp. 132–141

2013
[37]

Large language models for c test case generation: A comparative analysis,

A. Guzu, G. Nicolae, H. Cucu, and C. Burileanu, “Large language models for c test case generation: A comparative analysis,”Electronics, vol. 14, no. 11, p. 2284, 2025

2025
[38]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,

N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,”arXiv preprint arXiv:2503.01245, 2025

work page arXiv 2025
[39]

Aster: Natural and multi-language unit test generation with llms,

R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “Aster: Natural and multi-language unit test generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 (ICSE-SEIP). IEEE, 2025, pp. 413–424

2020
[40]

Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge,

Y . Zhang, Q. Lu, K. Liu, W. Dou, J. Zhu, L. Qian, C. Zhang, Z. Lin, and J. Wei, “Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge,”ACM Transactions on Soft- ware Engineering and Methodology, 2025

2025
[41]

Cubetesterai: Automated junit test generation using the llama model,

D. Gorla, S. Kumar, P. N. R. Lorenzini, and A. Alipourfaz, “Cubetesterai: Automated junit test generation using the llama model,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2025, pp. 565– 576

2025
[42]

Static program analysis guided llm based unit test genera- tion,

S. Roy Chowdhury, G. Sridhara, A. Raghavan, J. Bose, S. Mazumdar, H. Singh, S. B. Sugumaran, and R. Britto, “Static program analysis guided llm based unit test genera- tion,” inProceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD), 2024, pp. 279–283

2024

[1] [1]

Beck,Test driven development: By example

K. Beck,Test driven development: By example. Addison- Wesley Professional, 2022

2022

[2] [2]

Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,

Q. Xu, G. Wang, L. Briand, and K. Liu, “Hallucination to consensus: Multi-agent llms for end-to-end junit test generation,”ACM Transactions on Software Engineering and Methodology, 2026

2026

[3] [3]

Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark,

B. Yu, Y . Cao, Y . Zhang, L. Lin, J. Xu, Z. Zhong, Q. Xu, G. Wang, J. Cao, S.-C. Cheunget al., “Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark,”arXiv preprint arXiv:2603.00520, 2026

work page arXiv 2026

[4] [4]

A large-scale evaluation of automated unit test generation using evosuite,

G. Fraser and A. Arcuri, “A large-scale evaluation of automated unit test generation using evosuite,”ACM Trans. Softw. Eng. Methodol., vol. 24, no. 2, Dec. 2014. [Online]. Available: https://doi.org/10.1145/2685612

work page doi:10.1145/2685612 2014

[5] [5]

Randoop: feedback-directed random testing for java,

C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” inCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming sys- tems and applications companion, 2007, pp. 815–816

2007

[6] [6]

Mutation-Guided Unit Test Generation with a Large Language Model

G. Wang, Q. Xu, L. C. Briand, and K. Liu, “Mutation- guided unit test generation with a large language model,” arXiv preprint arXiv:2506.02954, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Llm test genera- tion via iterative hybrid program analysis,

S. Gu, N. Nashid, and A. Mesbah, “Llm test genera- tion via iterative hybrid program analysis,”arXiv preprint arXiv:2503.13580, 2025

work page arXiv 2025

[8] [8]

Large language models for unit test generation: Achievements, challenges, and opportunities,

B. Chu, Y . Feng, K. Liu, Z. Guo, Y . Zhang, H. Shi, Z. Nan, and B. Xu, “Large language models for unit test generation: Achievements, challenges, and opportunities,” arXiv preprint arXiv:2511.21382, 2025

work page arXiv 2025

[9] [9]

Coverup: Effective high coverage test generation for python,

J. Altmayer Pizzorno and E. D. Berger, “Coverup: Effective high coverage test generation for python,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2897– 2919, 2025

2025

[10] [10]

Llm-based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “Llm-based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, vol. 50, no. 9, pp. 2254–2268, 2024

2024

[11] [11]

TestEval: Benchmarking large language models for test case generation,

W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “TestEval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, New ...

2020

[12] [13]

To mock or not to mock? an empirical study on mocking prac- tices,

D. Spadini, M. Aniche, M. Bruntink, and A. Bacchelli, “To mock or not to mock? an empirical study on mocking prac- tices,” in2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 402–412

2017

[13] [14]

Togll: Correct and strong test oracle generation with llms,

S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,”arXiv preprint arXiv:2405.03786, 2024

work page arXiv 2024

[14] [15]

Modeling and discovering vulnerabilities with code property graphs,

F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” inProceedings of the 2014 IEEE Symposium on Security and Privacy (S&P). IEEE, 2014, pp. 590–604

2014

[15] [16]

Comex: A tool for generating customized source code representa- tions,

D. Das, N. S. Mathews, A. Mathai, S. Tamilselvam, K. Sedamaki, S. Chimalakonda, and A. Kumar, “Comex: A tool for generating customized source code representa- tions,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 2054–2057

2023

[16] [17]

Agent development kit for java (adk-java),

Google, “Agent development kit for java (adk-java),” https: //github.com/google/adk-java, 2025, accessed March 2026

2025

[17] [18]

Fit framework: Model- engineering utilities for java,

Huawei ModelEngine Group, “Fit framework: Model- engineering utilities for java,” https://github.com/ ModelEngine-Group/fit-framework, 2025, accessed March 2026

2025

[18] [19]

Agentscope: A flexible yet ro- bust multi-agent platform (java runtime),

AgentScope Contributors, “Agentscope: A flexible yet ro- bust multi-agent platform (java runtime),” https://github. com/agentscope-ai/agentscope-java, 2025, accessed March 2026

2025

[19] [20]

Agent-to-agent (a2a) sdk for java,

A2A Project Contributors, “Agent-to-agent (a2a) sdk for java,” https://github.com/a2aproject/a2a-java, 2025, accessed March 2026

2025

[20] [21]

jquick-curl: An antlr-based curl command parser for java,

H. Pao, “jquick-curl: An antlr-based curl command parser for java,” https://github.com/paohaijiao/jquick-curl, 2025, accessed March 2026

2025

[21] [22]

Chatunitest: A framework for llm-based test generation,

Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 572–576

2024

[22] [23]

64 Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman

A. Deljouyi, R. Koohestani, M. Izadi, and A. Zaidman, “Leveraging large language models for enhancing the understandability of generated unit tests,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 1449–1461. [Online]. Available: https://doi.org/10.1109/ ICSE55347.2025.00032

work page arXiv 2025

[23] [24]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ra- manathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 951–971, 2024

2024

[24] [25]

PIT: A practical mutation testing tool for Java (demo),

H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ven- tresque, “PIT: A practical mutation testing tool for Java (demo),” inProceedings of the 25th International Sympo- sium on Software Testing and Analysis (ISSTA). ACM, 2016, pp. 449–452

2016

[25] [26]

A practical guide for using statistical tests to assess randomized algorithms in software engineering,

A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” inProceedings of the 33rd international con- ference on software engineering, 2011, pp. 1–10

2011

[26] [27]

A critique and improvement of the cl common language effect size statistics of mcgraw and wong,

A. Vargha and H. D. Delaney, “A critique and improvement of the cl common language effect size statistics of mcgraw and wong,”Journal of Educational and Behavioral Statis- tics, vol. 25, no. 2, pp. 101–132, 2000

2000

[27] [28]

Langchain official website,

“Langchain official website,” Accessed: 2025. [Online]. Available: https://www.langchain.com/

2025

[28] [29]

Qwen3-Coder: Agentic coding models,

Qwen Team, “Qwen3-Coder: Agentic coding models,” https://qwenlm.github.io/blog/qwen3-coder/, 2025, accessed March 2026

2025

[29] [30]

Efficient memory management for large language model serving with pagedat- tention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedat- tention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

2023

[30] [31]

Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation,

H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation,” in Proceedings of the ACM on Web Conference 2025, 2025, pp. 2366–2377

2025

[31] [32]

EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair

F. Mu, J. Wang, L. Shi, S. Wang, S. Li, and Q. Wang, “Experepair: Dual-memory enhanced llm-based repository- level program repair,”arXiv preprint arXiv:2506.10484, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [33]

Citrus: Automated unit testing tool for real-world c++ programs,

R. S. Herlim, Y . Kim, and M. Kim, “Citrus: Automated unit testing tool for real-world c++ programs,” in2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, 2022, pp. 400–410

2022

[33] [34]

Pynguin: Automated unit test generation for python,

S. Lukasczyk and G. Fraser, “Pynguin: Automated unit test generation for python,” inProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 168–172

2022

[34] [35]

Test generation for embedded executables via concolic execution in a real environment,

T. Chen, X.-S. Zhang, X.-L. Ji, C. Zhu, Y . Bai, and Y . Wu, “Test generation for embedded executables via concolic execution in a real environment,”IEEE Transactions on Reliability, vol. 64, no. 1, pp. 284–296, 2014

2014

[35] [36]

Feedback-directed unit test generation for c/c++ using concolic execution,

P. Garg, F. Ivan ˇci´c, G. Balakrishnan, N. Maeda, and A. Gupta, “Feedback-directed unit test generation for c/c++ using concolic execution,” in2013 35th International Con- ference on Software Engineering (ICSE). IEEE, 2013, pp. 132–141

2013

[36] [37]

Large language models for c test case generation: A comparative analysis,

A. Guzu, G. Nicolae, H. Cucu, and C. Burileanu, “Large language models for c test case generation: A comparative analysis,”Electronics, vol. 14, no. 11, p. 2284, 2025

2025

[37] [38]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,

N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,”arXiv preprint arXiv:2503.01245, 2025

work page arXiv 2025

[38] [39]

Aster: Natural and multi-language unit test generation with llms,

R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “Aster: Natural and multi-language unit test generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 (ICSE-SEIP). IEEE, 2025, pp. 413–424

2020

[39] [40]

Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge,

Y . Zhang, Q. Lu, K. Liu, W. Dou, J. Zhu, L. Qian, C. Zhang, Z. Lin, and J. Wei, “Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge,”ACM Transactions on Soft- ware Engineering and Methodology, 2025

2025

[40] [41]

Cubetesterai: Automated junit test generation using the llama model,

D. Gorla, S. Kumar, P. N. R. Lorenzini, and A. Alipourfaz, “Cubetesterai: Automated junit test generation using the llama model,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2025, pp. 565– 576

2025

[41] [42]

Static program analysis guided llm based unit test genera- tion,

S. Roy Chowdhury, G. Sridhara, A. Raghavan, J. Bose, S. Mazumdar, H. Singh, S. B. Sugumaran, and R. Britto, “Static program analysis guided llm based unit test genera- tion,” inProceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD), 2024, pp. 279–283

2024