An empirical study of LoRA-based fine-tuning of large language models for automated test case generation

David Colwell; Ke Yan; Milad Moradi; Rhona Asgari

arxiv: 2604.06946 · v1 · submitted 2026-04-08 · 💻 cs.SE · cs.AI

An empirical study of LoRA-based fine-tuning of large language models for automated test case generation

Milad Moradi , Ke Yan , David Colwell , Rhona Asgari This is my paper

Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LoRAfine-tuningtest case generationlarge language modelsautomated testingsoftware requirementsparameter-efficient adaptationopen-source LLMs

0 comments

The pith

Fine-tuning an 8B open-source LLM with LoRA produces test cases from requirements that match the quality of GPT-4.1 models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical comparison of LoRA fine-tuning across several large language models for the task of turning natural language requirements into executable test cases. It introduces a GPT-4o-based automated scorer that rates outputs on nine quality dimensions and shows that LoRA adaptation lifts every open-source model tested, with the smallest 8B model reaching parity with much larger proprietary systems. The work therefore demonstrates that parameter-efficient fine-tuning can shrink the practical gap between locally deployable open models and closed high-cost alternatives for this software engineering task.

Core claim

LoRA-based fine-tuning of open-source LLMs, particularly Ministral-8B, yields test cases whose quality scores under the nine-dimension GPT-4o framework become comparable to those produced by pre-fine-tuned GPT-4.1 models, while narrowing the performance difference between proprietary and open-source families after adaptation.

What carries the argument

LoRA (Low-Rank Adaptation) applied to requirement-to-test-case generation, together with a nine-dimension automated quality evaluator powered by GPT-4o.

If this is right

LoRA fine-tuning raises performance for every open-source model examined.
Ministral-8B records the highest scores among the fine-tuned open-source models.
The gap in test-case quality between proprietary and open-source models shrinks substantially once LoRA adaptation is applied.
Systematic variation of LoRA rank, scaling factor, and dropout changes downstream test generation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LoRA recipe could be applied to other requirement-driven software engineering tasks such as code summarization or bug localization.
Organizations could replace API calls to proprietary models with locally hosted 8B models for routine test generation without large quality loss.
The nine-dimension evaluation template itself may be reusable for assessing LLM outputs on related structured generation problems.

Load-bearing premise

The GPT-4o automated scorer gives unbiased and reliable ratings of test case quality on the nine dimensions.

What would settle it

A side-by-side human expert rating of the same generated test cases that shows large, consistent disagreement with the GPT-4o dimension scores.

read the original abstract

Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study on LoRA-based fine-tuning of LLMs for automated test case generation from natural language requirements. It evaluates multiple open-source and proprietary model families under a unified pipeline, systematically varies LoRA hyperparameters (rank, scaling factor, dropout), and introduces a GPT-4o automated evaluator that scores outputs across nine quality dimensions. Central claims are that LoRA fine-tuning substantially improves open-source models (with Ministral-8B best among them), that a fine-tuned 8B open-source model reaches performance comparable to pre-fine-tuned GPT-4.1, and that fine-tuning substantially narrows the gap between proprietary and open-source models.

Significance. If the results hold after addressing evaluation validation, the work would be significant for software engineering by demonstrating that parameter-efficient fine-tuning can make smaller, locally deployable open-source LLMs competitive with proprietary systems for a practical task like test generation. The unified experimental pipeline and hyperparameter exploration are strengths that support reproducibility and practical guidance.

major comments (3)

[Abstract and Evaluation Framework] Abstract and Evaluation Framework section: The central claim that a fine-tuned 8B model achieves performance comparable to GPT-4.1 rests entirely on GPT-4o scores across nine dimensions. No calibration against human judgments, correlation study, or inter-annotator agreement is reported, and the judge belongs to the same model family as the GPT-4.1 baseline. This is load-bearing for the comparability result and risks systematic bias favoring proprietary outputs.
[Results] Results section: Performance comparisons and claims of significant improvement and comparability do not report statistical tests (e.g., significance levels, effect sizes, or confidence intervals) or details on dataset size, train/test split, or number of requirements evaluated. This limits assessment of whether observed differences are robust.
[Methodology] Methodology section: The unified pipeline description lacks explicit baselines such as zero-shot prompting or full fine-tuning for the open-source models, making it harder to isolate the specific contribution of LoRA adaptation to the reported gains.

minor comments (2)

[Abstract and Introduction] Clarify the exact meaning of 'GPT-4.1' (appears to be a non-standard designation) and ensure consistent model naming throughout.
[Results] Tables reporting nine-dimension scores should include per-dimension breakdowns or aggregate statistics to allow readers to identify which quality aspects drive the overall comparability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment point by point below, outlining our responses and the revisions we will make.

read point-by-point responses

Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: The central claim that a fine-tuned 8B model achieves performance comparable to GPT-4.1 rests entirely on GPT-4o scores across nine dimensions. No calibration against human judgments, correlation study, or inter-annotator agreement is reported, and the judge belongs to the same model family as the GPT-4.1 baseline. This is load-bearing for the comparability result and risks systematic bias favoring proprietary outputs.

Authors: We acknowledge that the reliance on GPT-4o for automated scoring is central to the comparability claims and that the absence of human calibration introduces a risk of bias, particularly given the model family overlap with the GPT-4.1 baseline. To mitigate this, the revised manuscript will include a human evaluation study on a stratified sample of test cases (approximately 10% of the evaluation set). We will report Pearson and Spearman correlations between GPT-4o scores and human ratings, along with inter-annotator agreement (Cohen's kappa) across the nine dimensions. This calibration will be presented in a new subsection of the Evaluation Framework. revision: yes
Referee: [Results] Results section: Performance comparisons and claims of significant improvement and comparability do not report statistical tests (e.g., significance levels, effect sizes, or confidence intervals) or details on dataset size, train/test split, or number of requirements evaluated. This limits assessment of whether observed differences are robust.

Authors: We agree that statistical tests and full experimental details are necessary to demonstrate robustness. The revised Results section will explicitly state the dataset size (number of requirements), the train/test split ratios, and the number of requirements used in evaluation. We will also add paired statistical tests (Wilcoxon signed-rank for non-normal distributions), effect sizes (Cohen's d), and 95% confidence intervals for all key comparisons between fine-tuned models and baselines. revision: yes
Referee: [Methodology] Methodology section: The unified pipeline description lacks explicit baselines such as zero-shot prompting or full fine-tuning for the open-source models, making it harder to isolate the specific contribution of LoRA adaptation to the reported gains.

Authors: We concur that explicit non-LoRA baselines would better isolate the contribution of parameter-efficient adaptation. The revised Methodology and Results sections will incorporate zero-shot prompting results for every open-source model under the same unified pipeline and prompt templates. For full fine-tuning, we will add a limited comparison using the smallest open-source model (where compute permits), while noting that full fine-tuning of larger models exceeds our available resources; this will be framed as a practical limitation rather than a comprehensive baseline. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical study of LoRA fine-tuning

full rationale

The paper reports direct experimental outcomes from fine-tuning open-source LLMs with LoRA on test case generation tasks and scoring outputs via a GPT-4o-based framework across nine dimensions. No derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described content. Results are presented as measured performance comparisons rather than quantities constructed from the paper's own inputs, rendering the work self-contained with no load-bearing steps that reduce by definition to prior elements within the same manuscript.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that the GPT-4o evaluator is a faithful proxy for human judgment of test-case quality and that the chosen requirements and test artifacts are representative of real software engineering practice.

free parameters (3)

LoRA rank
Hyperparameter varied across experiments to optimize downstream test-generation quality.
LoRA scaling factor
Hyperparameter explored as part of the systematic study of adaptation settings.
LoRA dropout
Hyperparameter tuned to control regularization during fine-tuning.

axioms (1)

domain assumption GPT-4o can serve as an unbiased and accurate automated judge of test-case quality across nine dimensions.
All reported performance differences and model comparisons depend on scores produced by this evaluator.

pith-pipeline@v0.9.0 · 5591 in / 1394 out tokens · 60099 ms · 2026-05-10T17:49:22.245786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

[1]

Software Testing Techniques: A Literature Review,

M. A. Jamil, M. Arif, N. S. A. Abubakar, and A. Ahmad, "Software Testing Techniques: A Literature Review," in 2016 6th International Conference on Information and Communication Technology for The Muslim World (ICT4M), 2016, pp. 177-182

work page 2016
[2]

An orchestrated survey of methodologies for automated software test case generation,

S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp , et al. , "An orchestrated survey of methodologies for automated software test case generation," Journal of Systems and Software, vol. 86, pp. 1978-2001, 2013

work page 1978
[3]

A3Test: Assertion -Augmented Automated Test case generation,

S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, "A3Test: Assertion -Augmented Automated Test case generation," Information and Software Technology, vol. 176, p. 107565, 2024

work page 2024
[4]

A Review of Large Language Models for Automated Test Case Generation,

A. Celik and Q. H. Mahmoud, "A Review of Large Language Models for Automated Test Case Generation," Machine Learning and Knowledge Extraction, vol. 7, p. 97, 2025

work page 2025
[5]

Automated test case generation from requirements: A systematic literature review,

A. Mustafa, W. M. Wan -Kadir, N. Ibrahim, M. A. Shah, M. Younas, A. Khan , et al. , "Automated test case generation from requirements: A systematic literature review," Computers, Materials and Continua, vol. 67, pp. 1819-1833, 2021

work page 2021
[6]

Current Trends in Automated Test Case Generation,

T. Potuzak and R. Lipka, "Current Trends in Automated Test Case Generation," in 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), 2023, pp. 627- 636

work page 2023
[7]

Rule -based generation of requirements traceability relations,

G. Spanoudakis, A. Zisman, E. Pérez -Miñana, and P. Krause, "Rule -based generation of requirements traceability relations," Journal of Systems and Software, vol. 72, pp. 105- 127, 2004

work page 2004
[8]

Automatic Generation of Acceptance Test Cases From Use Case Specifications: An NLP-Based Approach,

C. Wang, F. Pastore, A. Goknil, and L. C. Briand, "Automatic Generation of Acceptance Test Cases From Use Case Specifications: An NLP-Based Approach," IEEE Transactions on Software Engineering, vol. 48, pp. 585-616, 2022

work page 2022
[9]

Reinforcement-Learning-Based Test Program Generation for Software-Based Self-Test,

C. Y. Chen and J. L. Huang, "Reinforcement-Learning-Based Test Program Generation for Software-Based Self-Test," in 2019 IEEE 28th Asian Test Symposium (ATS), 2019, pp. 73- 735. 29

work page 2019
[10]

Taxonomy of Machine Learning Techniques in Test Case Generation,

A. Singh, "Taxonomy of Machine Learning Techniques in Test Case Generation," in 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS) , 2023, pp. 474-481

work page 2023
[11]

Large Language Models for Software Engineering: Survey and Open Problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo , et al. , "Large Language Models for Software Engineering: Survey and Open Problems," in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), 2023, pp. 31-53

work page 2023
[12]

Large Language Models for Software Engineering: A Systematic Literature Review,

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, et al., "Large Language Models for Software Engineering: A Systematic Literature Review," ACM Trans. Softw. Eng. Methodol., vol. 33, p. Article 220, 2024

work page 2024
[13]

TestEval: Benchmarking Large Language Models for Test Case Generation,

W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song , et al., "TestEval: Benchmarking Large Language Models for Test Case Generation," Albuquerque, New Mexico, 2025, pp. 3547-3562

work page 2025
[14]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,

M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, "An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation," IEEE Transactions on Software Engineering, vol. 50, pp. 85-105, 2024

work page 2024
[15]

Effective test generation using pre -trained Large Language Models and mutation testing,

A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C. Desmarais, "Effective test generation using pre -trained Large Language Models and mutation testing," Information and Software Technology, vol. 171, p. 107468, 2024

work page 2024
[16]

Automated Test Cases Generation From Requirements Specification,

M. Lafi, T. Alrawashed, and A. M. Hammad, "Automated Test Cases Generation From Requirements Specification," in 2021 International Conference on Information Technology (ICIT), 2021, pp. 852-857

work page 2021
[17]

Automatic test case generation using natural language processing: A systematic mapping study,

J. Navarro and R. Ibarra, "Automatic test case generation using natural language processing: A systematic mapping study," Information and Software Technology, vol. 189, p. 107929, 2026

work page 2026
[18]

Automated Test Case Generation From Natural Language Requirements Using Natural Language Processing,

F. Arooj, H. Alishba, and R. Summair, "Automated Test Case Generation From Natural Language Requirements Using Natural Language Processing," Journal of Computing & Biomedical Informatics, vol. 9, 09/01 2025

work page 2025
[19]

Software test case generation using natural language processing (NLP): a systematic literature review,

H. Ayenew and M. Wagaw, "Software test case generation using natural language processing (NLP): a systematic literature review," Artificial Intelligence Evolution, pp. 1- 10, 2024

work page 2024
[20]

Understanding the Performance and Estimating the Cost of LLM Fine -Tuning,

Y. Xia, J. Kim, Y. Chen, H. Ye, S. Kundu, C. C. Hao, et al., "Understanding the Performance and Estimating the Cost of LLM Fine -Tuning," in 2024 IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 210-223

work page 2024
[21]

Parameter-efficient fine-tuning of large-scale pre-trained language models,

N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, et al., "Parameter-efficient fine-tuning of large-scale pre-trained language models," Nature Machine Intelligence, vol. 5, pp. 220- 235, 2023

work page 2023
[22]

Requirements-based test generation: A comprehensive survey,

Z. Yang, R. Huang, C. Cui, N. Niu, and D. Towey, "Requirements-based test generation: A comprehensive survey," ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[23]

Enhancing large language models for text -to-testcase generation,

S. Alagarsamy, C. Tantithamthavorn, W. Takerngsaksiri, C. Arora, and A. Aleti, "Enhancing large language models for text -to-testcase generation," Journal of Systems and Software, vol. 230, p. 112531, 2025

work page 2025
[24]

Requirement-based automated black- box test generation,

L. H. Tahat, B. Vaysburg, B. Korel, and A. J. Bader, "Requirement-based automated black- box test generation," in 25th Annual International Computer Software and Applications Conference. COMPSAC 2001, 2001, pp. 489-495

work page 2001
[25]

Towards a systematic requirement -based test generation framework: Industrial challenges and needs,

S. Hesari, R. Behjati, and T. Yue, "Towards a systematic requirement -based test generation framework: Industrial challenges and needs," in 2013 21st IEEE International Requirements Engineering Conference (RE), 2013, pp. 261-266. 30

work page 2013
[26]

Generation of Test Cases from Software Requirements Using Natural Language Processing,

R. P. Verma and M. R. Beg, "Generation of Test Cases from Software Requirements Using Natural Language Processing," in 2013 6th International Conference on Emerging Trends in Engineering and Technology, 2013, pp. 140-147

work page 2013
[27]

Coverage-Directed Test Generation Automated by Machine Learning -- A Review,

C. Ioannides and K. I. Eder, "Coverage-Directed Test Generation Automated by Machine Learning -- A Review," ACM Trans. Des. Autom. Electron. Syst., vol. 17, p. Article 7, 2012

work page 2012
[28]

Preparation method in automated test case generation using machine learning,

K. Kikuma, T. Yamada, K. Sato, and K. Ueda, "Preparation method in automated test case generation using machine learning," in Proceedings of the 10th International Symposium on Information and Communication Technology, 2019, pp. 393-398

work page 2019
[29]

Machine Learning Techniques for Automated Test Case Generation and Optimization in Software Quality Assurance,

N. Venkata Siva Prakash, "Machine Learning Techniques for Automated Test Case Generation and Optimization in Software Quality Assurance," Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 4, pp. 289-327, 2020

work page 2020
[30]

Software Testing With Large Language Models: Survey, Landscape, and Vision,

J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, "Software Testing With Large Language Models: Survey, Landscape, and Vision," IEEE Transactions on Software Engineering, vol. 50, pp. 911-936, 2024

work page 2024
[31]

Towards an understanding of large language models in software engineering tasks,

Z. Zheng, K. Ning, Q. Zhong, J. Chen, W. Chen, L. Guo, et al., "Towards an understanding of large language models in software engineering tasks," Empirical Software Engineering, vol. 30, p. 50, 2024

work page 2024
[32]

Evaluating large language models for software testing,

Y. Li, P. Liu, H. Wang, J. Chu, and W. E. Wong, "Evaluating large language models for software testing," Computer Standards & Interfaces, vol. 93, p. 103942, 2025

work page 2025
[33]

A Survey on Large Language Models for Code Generation,

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, "A Survey on Large Language Models for Code Generation," ACM Trans. Softw. Eng. Methodol., vol. 35, p. Article 58, 2026

work page 2026
[34]

Evaluating Large Language Models in Class-Level Code Generation,

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, et al., "Evaluating Large Language Models in Class-Level Code Generation," presented at the Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 2024

work page 2024
[35]

Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks,

X. Hu, F. Niu, J. Chen, X. Zhou, J. Zhang, J. He , et al. , "Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks," ACM Trans. Softw. Eng. Methodol., 2026

work page 2026
[36]

Challenges in applying large language models to requirements engineering tasks,

J. J. Norheim, E. Rebentisch, D. Xiao, L. Draeger, A. Kerbrat, and O. L. de Weck, "Challenges in applying large language models to requirements engineering tasks," Design Science, vol. 10, p. e16, 2024

work page 2024
[37]

A., D˛ abrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A., 2025

M. A. Zadenoori, J. Dąbrowski, W. Alhoshan, L. Zhao, and A. Ferrari, "Large language models (llms) for requirements engineering (re): A systematic literature review," arXiv preprint arXiv:2509.11446, 2025

work page arXiv 2025
[38]

Fine -Tuning Large Language Models for Specialized Use Cases,

D. M. Anisuzzaman, J. G. Malins, P. A. Friedman, and Z. I. Attia, "Fine -Tuning Large Language Models for Specialized Use Cases," Mayo Clinic Proceedings: Digital Health, vol. 3, p. 100184, 2025

work page 2025
[39]

Unveiling the Generalization Power of Fine-Tuned Large Language Models,

H. Yang, Y. Zhang, J. Xu, H. Lu, P. -A. Heng, and W. Lam, "Unveiling the Generalization Power of Fine-Tuned Large Language Models," Mexico City, Mexico, 2024, pp. 884-899

work page 2024
[40]

Memorization without overfitting: Analyzing the training dynamics of large language models,

K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, "Memorization without overfitting: Analyzing the training dynamics of large language models," Advances in Neural Information Processing Systems, vol. 35, pp. 38274-38290, 2022

work page 2022
[41]

A Critical Review of Methods and Challenges in Large Language Models,

M. Moradi, K. Yan, D. Colwell, M. Samwald, and R. Asgari, "A Critical Review of Methods and Challenges in Large Language Models," Computers, Materials and Continua, vol. 82, pp. 1681-1698, 2025

work page 2025
[42]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, et al., "Lora: Low-rank adaptation of large language models," Iclr, vol. 1, p. 3, 2022

work page 2022
[43]

Low-rank adaptation for foundation models: A comprehensive review.arXiv preprint arXiv:2501.00365,

M. Yang, J. Chen, J. Tao, Y. Zhang, J. Liu, J. Zhang , et al. , "Low -rank adaptation for foundation models: A comprehensive review," arXiv preprint arXiv:2501.00365, 2024

work page arXiv 2024
[44]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu , et al. , "Deepseek-v3 technical report," arXiv preprint arXiv:2412.19437, 2024. 31

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. -A. Lachaux, T. Lacroix , et al. , "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

A survey of reinforcement learning from human feedback,

T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, "A survey of reinforcement learning from human feedback," Transactions on Machine Learning Research, 2024

work page 2024
[47]

M. AI. (01/10/2025). Mistral documentation. Available: https://docs.mistral.ai/

work page 2025
[48]

GQA: Training Generalized Multi -Query Transformer Models from Multi -Head Checkpoints,

J. Ainslie, J. Lee -Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, "GQA: Training Generalized Multi -Query Transformer Models from Multi -Head Checkpoints," Singapore, 2023, pp. 4895-4901

work page 2023
[49]

A Closer Look into Using Large Language Models for Automatic Evaluation,

C.-H. Chiang and H. -y. Lee, "A Closer Look into Using Large Language Models for Automatic Evaluation," Singapore, 2023, pp. 8928-8942

work page 2023
[50]

Large Language Model -Powered Automated Assessment: A Systematic Review,

E. Emirtekin, "Large Language Model -Powered Automated Assessment: A Systematic Review," Applied Sciences, vol. 15, p. 5683, 2025

work page 2025
[51]

Grade Like a Human: Rethinking Automated Assessment with Large Language Models,

W. Xie, J. Niu, C. J. Xue, and N. Guan, "Grade Like a Human: Rethinking Automated Assessment with Large Language Models," presented at the Proceedings of the International Conference on Research in Adaptive and Convergent Systems, 2026

work page 2026
[52]

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation,

T. Vu, K. Krishna, S. Alzubi, C. Tar, M. Faruqui, and Y.-H. Sung, "Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation," Miami, Florida, USA, 2024, pp. 17086-17105. 32 Appendix A The prompt used for generating test cases from test requirements. This prompt was employed during fine-tuning the LLMs, also during inference wh...

work page 2024
[53]

Requirements may involve templates, configuration fields, user permissions, workflows, or API interactions

[System response or validation] - If API testing is involved, include: 33 - HTTP method and endpoint - Payload and headers (if applicable) - Expected response and status code - Any conditions, edge cases, or validation rules Assume that the system includes both a web -based UI and RESTful APIs. Requirements may involve templates, configuration fields, use...

work page
[54]

Focus on *overall intent* and key concepts, not literal wording

Semantic Similarity: How closely the meaning of the model - generated test case matches the ground truth. Focus on *overall intent* and key concepts, not literal wording

work page
[55]

Information Coverage: How completely the generated output includes all the key details, preconditions, actions, expected outcomes, and edge cases present in the ground truth. 35

work page
[56]

Critical Content Match: Whether the model has preserved *must - have* elements (e.g., specific actions, roles, UI elements, data types) from the requirement or ground truth

work page
[57]

Structural and Format Accuracy: Whether the output is well - structured and conforms to expected formatting: - Clear test case title and type - Actionable, step-by-step instructions - Coherent order of steps - Consistent formatting

work page
[58]

Higher scores mean minimal omissions

Omission: Degree to which important information is *missing*. Higher scores mean minimal omissions

work page
[59]

Higher scores mean fewer or no hallucinations

Hallucination: Degree to which the model *invents irrelevant or unsupported content* not present in the requirement. Higher scores mean fewer or no hallucinations

work page
[60]

Higher scores mean the test case is easy to understand and unambiguous

Ambiguity: Clarity and precision of language used. Higher scores mean the test case is easy to understand and unambiguous

work page
[61]

Higher scores mean the output is concise and avoids redundant content

Redundancy: Whether the test case contains unnecessary repetition. Higher scores mean the output is concise and avoids redundant content

work page
[62]

semantic_similarity

Diversity and Novelty: If applicable, does the generated test case introduce valid, logically sound variations or interpretations of the requirement that differ meaningfully from the ground truth? == INPUT == ### Requirement - Name: {requirement_name} - Description: {requirement_description} ### Model-Generated Test Case {model_response} 36 ### Ground Tru...

work page

[1] [1]

Software Testing Techniques: A Literature Review,

M. A. Jamil, M. Arif, N. S. A. Abubakar, and A. Ahmad, "Software Testing Techniques: A Literature Review," in 2016 6th International Conference on Information and Communication Technology for The Muslim World (ICT4M), 2016, pp. 177-182

work page 2016

[2] [2]

An orchestrated survey of methodologies for automated software test case generation,

S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp , et al. , "An orchestrated survey of methodologies for automated software test case generation," Journal of Systems and Software, vol. 86, pp. 1978-2001, 2013

work page 1978

[3] [3]

A3Test: Assertion -Augmented Automated Test case generation,

S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, "A3Test: Assertion -Augmented Automated Test case generation," Information and Software Technology, vol. 176, p. 107565, 2024

work page 2024

[4] [4]

A Review of Large Language Models for Automated Test Case Generation,

A. Celik and Q. H. Mahmoud, "A Review of Large Language Models for Automated Test Case Generation," Machine Learning and Knowledge Extraction, vol. 7, p. 97, 2025

work page 2025

[5] [5]

Automated test case generation from requirements: A systematic literature review,

A. Mustafa, W. M. Wan -Kadir, N. Ibrahim, M. A. Shah, M. Younas, A. Khan , et al. , "Automated test case generation from requirements: A systematic literature review," Computers, Materials and Continua, vol. 67, pp. 1819-1833, 2021

work page 2021

[6] [6]

Current Trends in Automated Test Case Generation,

T. Potuzak and R. Lipka, "Current Trends in Automated Test Case Generation," in 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), 2023, pp. 627- 636

work page 2023

[7] [7]

Rule -based generation of requirements traceability relations,

G. Spanoudakis, A. Zisman, E. Pérez -Miñana, and P. Krause, "Rule -based generation of requirements traceability relations," Journal of Systems and Software, vol. 72, pp. 105- 127, 2004

work page 2004

[8] [8]

Automatic Generation of Acceptance Test Cases From Use Case Specifications: An NLP-Based Approach,

C. Wang, F. Pastore, A. Goknil, and L. C. Briand, "Automatic Generation of Acceptance Test Cases From Use Case Specifications: An NLP-Based Approach," IEEE Transactions on Software Engineering, vol. 48, pp. 585-616, 2022

work page 2022

[9] [9]

Reinforcement-Learning-Based Test Program Generation for Software-Based Self-Test,

C. Y. Chen and J. L. Huang, "Reinforcement-Learning-Based Test Program Generation for Software-Based Self-Test," in 2019 IEEE 28th Asian Test Symposium (ATS), 2019, pp. 73- 735. 29

work page 2019

[10] [10]

Taxonomy of Machine Learning Techniques in Test Case Generation,

A. Singh, "Taxonomy of Machine Learning Techniques in Test Case Generation," in 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS) , 2023, pp. 474-481

work page 2023

[11] [11]

Large Language Models for Software Engineering: Survey and Open Problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo , et al. , "Large Language Models for Software Engineering: Survey and Open Problems," in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), 2023, pp. 31-53

work page 2023

[12] [12]

Large Language Models for Software Engineering: A Systematic Literature Review,

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, et al., "Large Language Models for Software Engineering: A Systematic Literature Review," ACM Trans. Softw. Eng. Methodol., vol. 33, p. Article 220, 2024

work page 2024

[13] [13]

TestEval: Benchmarking Large Language Models for Test Case Generation,

W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song , et al., "TestEval: Benchmarking Large Language Models for Test Case Generation," Albuquerque, New Mexico, 2025, pp. 3547-3562

work page 2025

[14] [14]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,

M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, "An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation," IEEE Transactions on Software Engineering, vol. 50, pp. 85-105, 2024

work page 2024

[15] [15]

Effective test generation using pre -trained Large Language Models and mutation testing,

A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C. Desmarais, "Effective test generation using pre -trained Large Language Models and mutation testing," Information and Software Technology, vol. 171, p. 107468, 2024

work page 2024

[16] [16]

Automated Test Cases Generation From Requirements Specification,

M. Lafi, T. Alrawashed, and A. M. Hammad, "Automated Test Cases Generation From Requirements Specification," in 2021 International Conference on Information Technology (ICIT), 2021, pp. 852-857

work page 2021

[17] [17]

Automatic test case generation using natural language processing: A systematic mapping study,

J. Navarro and R. Ibarra, "Automatic test case generation using natural language processing: A systematic mapping study," Information and Software Technology, vol. 189, p. 107929, 2026

work page 2026

[18] [18]

Automated Test Case Generation From Natural Language Requirements Using Natural Language Processing,

F. Arooj, H. Alishba, and R. Summair, "Automated Test Case Generation From Natural Language Requirements Using Natural Language Processing," Journal of Computing & Biomedical Informatics, vol. 9, 09/01 2025

work page 2025

[19] [19]

Software test case generation using natural language processing (NLP): a systematic literature review,

H. Ayenew and M. Wagaw, "Software test case generation using natural language processing (NLP): a systematic literature review," Artificial Intelligence Evolution, pp. 1- 10, 2024

work page 2024

[20] [20]

Understanding the Performance and Estimating the Cost of LLM Fine -Tuning,

Y. Xia, J. Kim, Y. Chen, H. Ye, S. Kundu, C. C. Hao, et al., "Understanding the Performance and Estimating the Cost of LLM Fine -Tuning," in 2024 IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 210-223

work page 2024

[21] [21]

Parameter-efficient fine-tuning of large-scale pre-trained language models,

N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, et al., "Parameter-efficient fine-tuning of large-scale pre-trained language models," Nature Machine Intelligence, vol. 5, pp. 220- 235, 2023

work page 2023

[22] [22]

Requirements-based test generation: A comprehensive survey,

Z. Yang, R. Huang, C. Cui, N. Niu, and D. Towey, "Requirements-based test generation: A comprehensive survey," ACM Transactions on Software Engineering and Methodology, 2025

work page 2025

[23] [23]

Enhancing large language models for text -to-testcase generation,

S. Alagarsamy, C. Tantithamthavorn, W. Takerngsaksiri, C. Arora, and A. Aleti, "Enhancing large language models for text -to-testcase generation," Journal of Systems and Software, vol. 230, p. 112531, 2025

work page 2025

[24] [24]

Requirement-based automated black- box test generation,

L. H. Tahat, B. Vaysburg, B. Korel, and A. J. Bader, "Requirement-based automated black- box test generation," in 25th Annual International Computer Software and Applications Conference. COMPSAC 2001, 2001, pp. 489-495

work page 2001

[25] [25]

Towards a systematic requirement -based test generation framework: Industrial challenges and needs,

S. Hesari, R. Behjati, and T. Yue, "Towards a systematic requirement -based test generation framework: Industrial challenges and needs," in 2013 21st IEEE International Requirements Engineering Conference (RE), 2013, pp. 261-266. 30

work page 2013

[26] [26]

Generation of Test Cases from Software Requirements Using Natural Language Processing,

R. P. Verma and M. R. Beg, "Generation of Test Cases from Software Requirements Using Natural Language Processing," in 2013 6th International Conference on Emerging Trends in Engineering and Technology, 2013, pp. 140-147

work page 2013

[27] [27]

Coverage-Directed Test Generation Automated by Machine Learning -- A Review,

C. Ioannides and K. I. Eder, "Coverage-Directed Test Generation Automated by Machine Learning -- A Review," ACM Trans. Des. Autom. Electron. Syst., vol. 17, p. Article 7, 2012

work page 2012

[28] [28]

Preparation method in automated test case generation using machine learning,

K. Kikuma, T. Yamada, K. Sato, and K. Ueda, "Preparation method in automated test case generation using machine learning," in Proceedings of the 10th International Symposium on Information and Communication Technology, 2019, pp. 393-398

work page 2019

[29] [29]

Machine Learning Techniques for Automated Test Case Generation and Optimization in Software Quality Assurance,

N. Venkata Siva Prakash, "Machine Learning Techniques for Automated Test Case Generation and Optimization in Software Quality Assurance," Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 4, pp. 289-327, 2020

work page 2020

[30] [30]

Software Testing With Large Language Models: Survey, Landscape, and Vision,

J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, "Software Testing With Large Language Models: Survey, Landscape, and Vision," IEEE Transactions on Software Engineering, vol. 50, pp. 911-936, 2024

work page 2024

[31] [31]

Towards an understanding of large language models in software engineering tasks,

Z. Zheng, K. Ning, Q. Zhong, J. Chen, W. Chen, L. Guo, et al., "Towards an understanding of large language models in software engineering tasks," Empirical Software Engineering, vol. 30, p. 50, 2024

work page 2024

[32] [32]

Evaluating large language models for software testing,

Y. Li, P. Liu, H. Wang, J. Chu, and W. E. Wong, "Evaluating large language models for software testing," Computer Standards & Interfaces, vol. 93, p. 103942, 2025

work page 2025

[33] [33]

A Survey on Large Language Models for Code Generation,

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, "A Survey on Large Language Models for Code Generation," ACM Trans. Softw. Eng. Methodol., vol. 35, p. Article 58, 2026

work page 2026

[34] [34]

Evaluating Large Language Models in Class-Level Code Generation,

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, et al., "Evaluating Large Language Models in Class-Level Code Generation," presented at the Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 2024

work page 2024

[35] [35]

Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks,

X. Hu, F. Niu, J. Chen, X. Zhou, J. Zhang, J. He , et al. , "Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks," ACM Trans. Softw. Eng. Methodol., 2026

work page 2026

[36] [36]

Challenges in applying large language models to requirements engineering tasks,

J. J. Norheim, E. Rebentisch, D. Xiao, L. Draeger, A. Kerbrat, and O. L. de Weck, "Challenges in applying large language models to requirements engineering tasks," Design Science, vol. 10, p. e16, 2024

work page 2024

[37] [37]

A., D˛ abrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A., 2025

M. A. Zadenoori, J. Dąbrowski, W. Alhoshan, L. Zhao, and A. Ferrari, "Large language models (llms) for requirements engineering (re): A systematic literature review," arXiv preprint arXiv:2509.11446, 2025

work page arXiv 2025

[38] [38]

Fine -Tuning Large Language Models for Specialized Use Cases,

D. M. Anisuzzaman, J. G. Malins, P. A. Friedman, and Z. I. Attia, "Fine -Tuning Large Language Models for Specialized Use Cases," Mayo Clinic Proceedings: Digital Health, vol. 3, p. 100184, 2025

work page 2025

[39] [39]

Unveiling the Generalization Power of Fine-Tuned Large Language Models,

H. Yang, Y. Zhang, J. Xu, H. Lu, P. -A. Heng, and W. Lam, "Unveiling the Generalization Power of Fine-Tuned Large Language Models," Mexico City, Mexico, 2024, pp. 884-899

work page 2024

[40] [40]

Memorization without overfitting: Analyzing the training dynamics of large language models,

K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, "Memorization without overfitting: Analyzing the training dynamics of large language models," Advances in Neural Information Processing Systems, vol. 35, pp. 38274-38290, 2022

work page 2022

[41] [41]

A Critical Review of Methods and Challenges in Large Language Models,

M. Moradi, K. Yan, D. Colwell, M. Samwald, and R. Asgari, "A Critical Review of Methods and Challenges in Large Language Models," Computers, Materials and Continua, vol. 82, pp. 1681-1698, 2025

work page 2025

[42] [42]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, et al., "Lora: Low-rank adaptation of large language models," Iclr, vol. 1, p. 3, 2022

work page 2022

[43] [43]

Low-rank adaptation for foundation models: A comprehensive review.arXiv preprint arXiv:2501.00365,

M. Yang, J. Chen, J. Tao, Y. Zhang, J. Liu, J. Zhang , et al. , "Low -rank adaptation for foundation models: A comprehensive review," arXiv preprint arXiv:2501.00365, 2024

work page arXiv 2024

[44] [44]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu , et al. , "Deepseek-v3 technical report," arXiv preprint arXiv:2412.19437, 2024. 31

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. -A. Lachaux, T. Lacroix , et al. , "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

A survey of reinforcement learning from human feedback,

T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, "A survey of reinforcement learning from human feedback," Transactions on Machine Learning Research, 2024

work page 2024

[47] [47]

M. AI. (01/10/2025). Mistral documentation. Available: https://docs.mistral.ai/

work page 2025

[48] [48]

GQA: Training Generalized Multi -Query Transformer Models from Multi -Head Checkpoints,

J. Ainslie, J. Lee -Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, "GQA: Training Generalized Multi -Query Transformer Models from Multi -Head Checkpoints," Singapore, 2023, pp. 4895-4901

work page 2023

[49] [49]

A Closer Look into Using Large Language Models for Automatic Evaluation,

C.-H. Chiang and H. -y. Lee, "A Closer Look into Using Large Language Models for Automatic Evaluation," Singapore, 2023, pp. 8928-8942

work page 2023

[50] [50]

Large Language Model -Powered Automated Assessment: A Systematic Review,

E. Emirtekin, "Large Language Model -Powered Automated Assessment: A Systematic Review," Applied Sciences, vol. 15, p. 5683, 2025

work page 2025

[51] [51]

Grade Like a Human: Rethinking Automated Assessment with Large Language Models,

W. Xie, J. Niu, C. J. Xue, and N. Guan, "Grade Like a Human: Rethinking Automated Assessment with Large Language Models," presented at the Proceedings of the International Conference on Research in Adaptive and Convergent Systems, 2026

work page 2026

[52] [52]

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation,

T. Vu, K. Krishna, S. Alzubi, C. Tar, M. Faruqui, and Y.-H. Sung, "Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation," Miami, Florida, USA, 2024, pp. 17086-17105. 32 Appendix A The prompt used for generating test cases from test requirements. This prompt was employed during fine-tuning the LLMs, also during inference wh...

work page 2024

[53] [53]

Requirements may involve templates, configuration fields, user permissions, workflows, or API interactions

[System response or validation] - If API testing is involved, include: 33 - HTTP method and endpoint - Payload and headers (if applicable) - Expected response and status code - Any conditions, edge cases, or validation rules Assume that the system includes both a web -based UI and RESTful APIs. Requirements may involve templates, configuration fields, use...

work page

[54] [54]

Focus on *overall intent* and key concepts, not literal wording

Semantic Similarity: How closely the meaning of the model - generated test case matches the ground truth. Focus on *overall intent* and key concepts, not literal wording

work page

[55] [55]

Information Coverage: How completely the generated output includes all the key details, preconditions, actions, expected outcomes, and edge cases present in the ground truth. 35

work page

[56] [56]

Critical Content Match: Whether the model has preserved *must - have* elements (e.g., specific actions, roles, UI elements, data types) from the requirement or ground truth

work page

[57] [57]

Structural and Format Accuracy: Whether the output is well - structured and conforms to expected formatting: - Clear test case title and type - Actionable, step-by-step instructions - Coherent order of steps - Consistent formatting

work page

[58] [58]

Higher scores mean minimal omissions

Omission: Degree to which important information is *missing*. Higher scores mean minimal omissions

work page

[59] [59]

Higher scores mean fewer or no hallucinations

Hallucination: Degree to which the model *invents irrelevant or unsupported content* not present in the requirement. Higher scores mean fewer or no hallucinations

work page

[60] [60]

Higher scores mean the test case is easy to understand and unambiguous

Ambiguity: Clarity and precision of language used. Higher scores mean the test case is easy to understand and unambiguous

work page

[61] [61]

Higher scores mean the output is concise and avoids redundant content

Redundancy: Whether the test case contains unnecessary repetition. Higher scores mean the output is concise and avoids redundant content

work page

[62] [62]

semantic_similarity

Diversity and Novelty: If applicable, does the generated test case introduce valid, logically sound variations or interpretations of the requirement that differ meaningfully from the ground truth? == INPUT == ### Requirement - Name: {requirement_name} - Description: {requirement_description} ### Model-Generated Test Case {model_response} 36 ### Ground Tru...

work page