pith. sign in

arxiv: 2604.06946 · v1 · submitted 2026-04-08 · 💻 cs.SE · cs.AI

An empirical study of LoRA-based fine-tuning of large language models for automated test case generation

Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LoRAfine-tuningtest case generationlarge language modelsautomated testingsoftware requirementsparameter-efficient adaptationopen-source LLMs
0
0 comments X

The pith

Fine-tuning an 8B open-source LLM with LoRA produces test cases from requirements that match the quality of GPT-4.1 models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical comparison of LoRA fine-tuning across several large language models for the task of turning natural language requirements into executable test cases. It introduces a GPT-4o-based automated scorer that rates outputs on nine quality dimensions and shows that LoRA adaptation lifts every open-source model tested, with the smallest 8B model reaching parity with much larger proprietary systems. The work therefore demonstrates that parameter-efficient fine-tuning can shrink the practical gap between locally deployable open models and closed high-cost alternatives for this software engineering task.

Core claim

LoRA-based fine-tuning of open-source LLMs, particularly Ministral-8B, yields test cases whose quality scores under the nine-dimension GPT-4o framework become comparable to those produced by pre-fine-tuned GPT-4.1 models, while narrowing the performance difference between proprietary and open-source families after adaptation.

What carries the argument

LoRA (Low-Rank Adaptation) applied to requirement-to-test-case generation, together with a nine-dimension automated quality evaluator powered by GPT-4o.

If this is right

  • LoRA fine-tuning raises performance for every open-source model examined.
  • Ministral-8B records the highest scores among the fine-tuned open-source models.
  • The gap in test-case quality between proprietary and open-source models shrinks substantially once LoRA adaptation is applied.
  • Systematic variation of LoRA rank, scaling factor, and dropout changes downstream test generation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LoRA recipe could be applied to other requirement-driven software engineering tasks such as code summarization or bug localization.
  • Organizations could replace API calls to proprietary models with locally hosted 8B models for routine test generation without large quality loss.
  • The nine-dimension evaluation template itself may be reusable for assessing LLM outputs on related structured generation problems.

Load-bearing premise

The GPT-4o automated scorer gives unbiased and reliable ratings of test case quality on the nine dimensions.

What would settle it

A side-by-side human expert rating of the same generated test cases that shows large, consistent disagreement with the GPT-4o dimension scores.

read the original abstract

Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study on LoRA-based fine-tuning of LLMs for automated test case generation from natural language requirements. It evaluates multiple open-source and proprietary model families under a unified pipeline, systematically varies LoRA hyperparameters (rank, scaling factor, dropout), and introduces a GPT-4o automated evaluator that scores outputs across nine quality dimensions. Central claims are that LoRA fine-tuning substantially improves open-source models (with Ministral-8B best among them), that a fine-tuned 8B open-source model reaches performance comparable to pre-fine-tuned GPT-4.1, and that fine-tuning substantially narrows the gap between proprietary and open-source models.

Significance. If the results hold after addressing evaluation validation, the work would be significant for software engineering by demonstrating that parameter-efficient fine-tuning can make smaller, locally deployable open-source LLMs competitive with proprietary systems for a practical task like test generation. The unified experimental pipeline and hyperparameter exploration are strengths that support reproducibility and practical guidance.

major comments (3)
  1. [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: The central claim that a fine-tuned 8B model achieves performance comparable to GPT-4.1 rests entirely on GPT-4o scores across nine dimensions. No calibration against human judgments, correlation study, or inter-annotator agreement is reported, and the judge belongs to the same model family as the GPT-4.1 baseline. This is load-bearing for the comparability result and risks systematic bias favoring proprietary outputs.
  2. [Results] Results section: Performance comparisons and claims of significant improvement and comparability do not report statistical tests (e.g., significance levels, effect sizes, or confidence intervals) or details on dataset size, train/test split, or number of requirements evaluated. This limits assessment of whether observed differences are robust.
  3. [Methodology] Methodology section: The unified pipeline description lacks explicit baselines such as zero-shot prompting or full fine-tuning for the open-source models, making it harder to isolate the specific contribution of LoRA adaptation to the reported gains.
minor comments (2)
  1. [Abstract and Introduction] Clarify the exact meaning of 'GPT-4.1' (appears to be a non-standard designation) and ensure consistent model naming throughout.
  2. [Results] Tables reporting nine-dimension scores should include per-dimension breakdowns or aggregate statistics to allow readers to identify which quality aspects drive the overall comparability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment point by point below, outlining our responses and the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: The central claim that a fine-tuned 8B model achieves performance comparable to GPT-4.1 rests entirely on GPT-4o scores across nine dimensions. No calibration against human judgments, correlation study, or inter-annotator agreement is reported, and the judge belongs to the same model family as the GPT-4.1 baseline. This is load-bearing for the comparability result and risks systematic bias favoring proprietary outputs.

    Authors: We acknowledge that the reliance on GPT-4o for automated scoring is central to the comparability claims and that the absence of human calibration introduces a risk of bias, particularly given the model family overlap with the GPT-4.1 baseline. To mitigate this, the revised manuscript will include a human evaluation study on a stratified sample of test cases (approximately 10% of the evaluation set). We will report Pearson and Spearman correlations between GPT-4o scores and human ratings, along with inter-annotator agreement (Cohen's kappa) across the nine dimensions. This calibration will be presented in a new subsection of the Evaluation Framework. revision: yes

  2. Referee: [Results] Results section: Performance comparisons and claims of significant improvement and comparability do not report statistical tests (e.g., significance levels, effect sizes, or confidence intervals) or details on dataset size, train/test split, or number of requirements evaluated. This limits assessment of whether observed differences are robust.

    Authors: We agree that statistical tests and full experimental details are necessary to demonstrate robustness. The revised Results section will explicitly state the dataset size (number of requirements), the train/test split ratios, and the number of requirements used in evaluation. We will also add paired statistical tests (Wilcoxon signed-rank for non-normal distributions), effect sizes (Cohen's d), and 95% confidence intervals for all key comparisons between fine-tuned models and baselines. revision: yes

  3. Referee: [Methodology] Methodology section: The unified pipeline description lacks explicit baselines such as zero-shot prompting or full fine-tuning for the open-source models, making it harder to isolate the specific contribution of LoRA adaptation to the reported gains.

    Authors: We concur that explicit non-LoRA baselines would better isolate the contribution of parameter-efficient adaptation. The revised Methodology and Results sections will incorporate zero-shot prompting results for every open-source model under the same unified pipeline and prompt templates. For full fine-tuning, we will add a limited comparison using the smallest open-source model (where compute permits), while noting that full fine-tuning of larger models exceeds our available resources; this will be framed as a practical limitation rather than a comprehensive baseline. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical study of LoRA fine-tuning

full rationale

The paper reports direct experimental outcomes from fine-tuning open-source LLMs with LoRA on test case generation tasks and scoring outputs via a GPT-4o-based framework across nine dimensions. No derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described content. Results are presented as measured performance comparisons rather than quantities constructed from the paper's own inputs, rendering the work self-contained with no load-bearing steps that reduce by definition to prior elements within the same manuscript.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that the GPT-4o evaluator is a faithful proxy for human judgment of test-case quality and that the chosen requirements and test artifacts are representative of real software engineering practice.

free parameters (3)
  • LoRA rank
    Hyperparameter varied across experiments to optimize downstream test-generation quality.
  • LoRA scaling factor
    Hyperparameter explored as part of the systematic study of adaptation settings.
  • LoRA dropout
    Hyperparameter tuned to control regularization during fine-tuning.
axioms (1)
  • domain assumption GPT-4o can serve as an unbiased and accurate automated judge of test-case quality across nine dimensions.
    All reported performance differences and model comparisons depend on scores produced by this evaluator.

pith-pipeline@v0.9.0 · 5591 in / 1394 out tokens · 60099 ms · 2026-05-10T17:49:22.245786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

  1. [1]

    Software Testing Techniques: A Literature Review,

    M. A. Jamil, M. Arif, N. S. A. Abubakar, and A. Ahmad, "Software Testing Techniques: A Literature Review," in 2016 6th International Conference on Information and Communication Technology for The Muslim World (ICT4M), 2016, pp. 177-182

  2. [2]

    An orchestrated survey of methodologies for automated software test case generation,

    S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp , et al. , "An orchestrated survey of methodologies for automated software test case generation," Journal of Systems and Software, vol. 86, pp. 1978-2001, 2013

  3. [3]

    A3Test: Assertion -Augmented Automated Test case generation,

    S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, "A3Test: Assertion -Augmented Automated Test case generation," Information and Software Technology, vol. 176, p. 107565, 2024

  4. [4]

    A Review of Large Language Models for Automated Test Case Generation,

    A. Celik and Q. H. Mahmoud, "A Review of Large Language Models for Automated Test Case Generation," Machine Learning and Knowledge Extraction, vol. 7, p. 97, 2025

  5. [5]

    Automated test case generation from requirements: A systematic literature review,

    A. Mustafa, W. M. Wan -Kadir, N. Ibrahim, M. A. Shah, M. Younas, A. Khan , et al. , "Automated test case generation from requirements: A systematic literature review," Computers, Materials and Continua, vol. 67, pp. 1819-1833, 2021

  6. [6]

    Current Trends in Automated Test Case Generation,

    T. Potuzak and R. Lipka, "Current Trends in Automated Test Case Generation," in 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), 2023, pp. 627- 636

  7. [7]

    Rule -based generation of requirements traceability relations,

    G. Spanoudakis, A. Zisman, E. Pérez -Miñana, and P. Krause, "Rule -based generation of requirements traceability relations," Journal of Systems and Software, vol. 72, pp. 105- 127, 2004

  8. [8]

    Automatic Generation of Acceptance Test Cases From Use Case Specifications: An NLP-Based Approach,

    C. Wang, F. Pastore, A. Goknil, and L. C. Briand, "Automatic Generation of Acceptance Test Cases From Use Case Specifications: An NLP-Based Approach," IEEE Transactions on Software Engineering, vol. 48, pp. 585-616, 2022

  9. [9]

    Reinforcement-Learning-Based Test Program Generation for Software-Based Self-Test,

    C. Y. Chen and J. L. Huang, "Reinforcement-Learning-Based Test Program Generation for Software-Based Self-Test," in 2019 IEEE 28th Asian Test Symposium (ATS), 2019, pp. 73- 735. 29

  10. [10]

    Taxonomy of Machine Learning Techniques in Test Case Generation,

    A. Singh, "Taxonomy of Machine Learning Techniques in Test Case Generation," in 2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS) , 2023, pp. 474-481

  11. [11]

    Large Language Models for Software Engineering: Survey and Open Problems,

    A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo , et al. , "Large Language Models for Software Engineering: Survey and Open Problems," in 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), 2023, pp. 31-53

  12. [12]

    Large Language Models for Software Engineering: A Systematic Literature Review,

    X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, et al., "Large Language Models for Software Engineering: A Systematic Literature Review," ACM Trans. Softw. Eng. Methodol., vol. 33, p. Article 220, 2024

  13. [13]

    TestEval: Benchmarking Large Language Models for Test Case Generation,

    W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song , et al., "TestEval: Benchmarking Large Language Models for Test Case Generation," Albuquerque, New Mexico, 2025, pp. 3547-3562

  14. [14]

    An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,

    M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, "An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation," IEEE Transactions on Software Engineering, vol. 50, pp. 85-105, 2024

  15. [15]

    Effective test generation using pre -trained Large Language Models and mutation testing,

    A. M. Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C. Desmarais, "Effective test generation using pre -trained Large Language Models and mutation testing," Information and Software Technology, vol. 171, p. 107468, 2024

  16. [16]

    Automated Test Cases Generation From Requirements Specification,

    M. Lafi, T. Alrawashed, and A. M. Hammad, "Automated Test Cases Generation From Requirements Specification," in 2021 International Conference on Information Technology (ICIT), 2021, pp. 852-857

  17. [17]

    Automatic test case generation using natural language processing: A systematic mapping study,

    J. Navarro and R. Ibarra, "Automatic test case generation using natural language processing: A systematic mapping study," Information and Software Technology, vol. 189, p. 107929, 2026

  18. [18]

    Automated Test Case Generation From Natural Language Requirements Using Natural Language Processing,

    F. Arooj, H. Alishba, and R. Summair, "Automated Test Case Generation From Natural Language Requirements Using Natural Language Processing," Journal of Computing & Biomedical Informatics, vol. 9, 09/01 2025

  19. [19]

    Software test case generation using natural language processing (NLP): a systematic literature review,

    H. Ayenew and M. Wagaw, "Software test case generation using natural language processing (NLP): a systematic literature review," Artificial Intelligence Evolution, pp. 1- 10, 2024

  20. [20]

    Understanding the Performance and Estimating the Cost of LLM Fine -Tuning,

    Y. Xia, J. Kim, Y. Chen, H. Ye, S. Kundu, C. C. Hao, et al., "Understanding the Performance and Estimating the Cost of LLM Fine -Tuning," in 2024 IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 210-223

  21. [21]

    Parameter-efficient fine-tuning of large-scale pre-trained language models,

    N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, et al., "Parameter-efficient fine-tuning of large-scale pre-trained language models," Nature Machine Intelligence, vol. 5, pp. 220- 235, 2023

  22. [22]

    Requirements-based test generation: A comprehensive survey,

    Z. Yang, R. Huang, C. Cui, N. Niu, and D. Towey, "Requirements-based test generation: A comprehensive survey," ACM Transactions on Software Engineering and Methodology, 2025

  23. [23]

    Enhancing large language models for text -to-testcase generation,

    S. Alagarsamy, C. Tantithamthavorn, W. Takerngsaksiri, C. Arora, and A. Aleti, "Enhancing large language models for text -to-testcase generation," Journal of Systems and Software, vol. 230, p. 112531, 2025

  24. [24]

    Requirement-based automated black- box test generation,

    L. H. Tahat, B. Vaysburg, B. Korel, and A. J. Bader, "Requirement-based automated black- box test generation," in 25th Annual International Computer Software and Applications Conference. COMPSAC 2001, 2001, pp. 489-495

  25. [25]

    Towards a systematic requirement -based test generation framework: Industrial challenges and needs,

    S. Hesari, R. Behjati, and T. Yue, "Towards a systematic requirement -based test generation framework: Industrial challenges and needs," in 2013 21st IEEE International Requirements Engineering Conference (RE), 2013, pp. 261-266. 30

  26. [26]

    Generation of Test Cases from Software Requirements Using Natural Language Processing,

    R. P. Verma and M. R. Beg, "Generation of Test Cases from Software Requirements Using Natural Language Processing," in 2013 6th International Conference on Emerging Trends in Engineering and Technology, 2013, pp. 140-147

  27. [27]

    Coverage-Directed Test Generation Automated by Machine Learning -- A Review,

    C. Ioannides and K. I. Eder, "Coverage-Directed Test Generation Automated by Machine Learning -- A Review," ACM Trans. Des. Autom. Electron. Syst., vol. 17, p. Article 7, 2012

  28. [28]

    Preparation method in automated test case generation using machine learning,

    K. Kikuma, T. Yamada, K. Sato, and K. Ueda, "Preparation method in automated test case generation using machine learning," in Proceedings of the 10th International Symposium on Information and Communication Technology, 2019, pp. 393-398

  29. [29]

    Machine Learning Techniques for Automated Test Case Generation and Optimization in Software Quality Assurance,

    N. Venkata Siva Prakash, "Machine Learning Techniques for Automated Test Case Generation and Optimization in Software Quality Assurance," Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 4, pp. 289-327, 2020

  30. [30]

    Software Testing With Large Language Models: Survey, Landscape, and Vision,

    J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, "Software Testing With Large Language Models: Survey, Landscape, and Vision," IEEE Transactions on Software Engineering, vol. 50, pp. 911-936, 2024

  31. [31]

    Towards an understanding of large language models in software engineering tasks,

    Z. Zheng, K. Ning, Q. Zhong, J. Chen, W. Chen, L. Guo, et al., "Towards an understanding of large language models in software engineering tasks," Empirical Software Engineering, vol. 30, p. 50, 2024

  32. [32]

    Evaluating large language models for software testing,

    Y. Li, P. Liu, H. Wang, J. Chu, and W. E. Wong, "Evaluating large language models for software testing," Computer Standards & Interfaces, vol. 93, p. 103942, 2025

  33. [33]

    A Survey on Large Language Models for Code Generation,

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, "A Survey on Large Language Models for Code Generation," ACM Trans. Softw. Eng. Methodol., vol. 35, p. Article 58, 2026

  34. [34]

    Evaluating Large Language Models in Class-Level Code Generation,

    X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, et al., "Evaluating Large Language Models in Class-Level Code Generation," presented at the Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 2024

  35. [35]

    Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks,

    X. Hu, F. Niu, J. Chen, X. Zhou, J. Zhang, J. He , et al. , "Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks," ACM Trans. Softw. Eng. Methodol., 2026

  36. [36]

    Challenges in applying large language models to requirements engineering tasks,

    J. J. Norheim, E. Rebentisch, D. Xiao, L. Draeger, A. Kerbrat, and O. L. de Weck, "Challenges in applying large language models to requirements engineering tasks," Design Science, vol. 10, p. e16, 2024

  37. [37]

    A., D˛ abrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A., 2025

    M. A. Zadenoori, J. Dąbrowski, W. Alhoshan, L. Zhao, and A. Ferrari, "Large language models (llms) for requirements engineering (re): A systematic literature review," arXiv preprint arXiv:2509.11446, 2025

  38. [38]

    Fine -Tuning Large Language Models for Specialized Use Cases,

    D. M. Anisuzzaman, J. G. Malins, P. A. Friedman, and Z. I. Attia, "Fine -Tuning Large Language Models for Specialized Use Cases," Mayo Clinic Proceedings: Digital Health, vol. 3, p. 100184, 2025

  39. [39]

    Unveiling the Generalization Power of Fine-Tuned Large Language Models,

    H. Yang, Y. Zhang, J. Xu, H. Lu, P. -A. Heng, and W. Lam, "Unveiling the Generalization Power of Fine-Tuned Large Language Models," Mexico City, Mexico, 2024, pp. 884-899

  40. [40]

    Memorization without overfitting: Analyzing the training dynamics of large language models,

    K. Tirumala, A. Markosyan, L. Zettlemoyer, and A. Aghajanyan, "Memorization without overfitting: Analyzing the training dynamics of large language models," Advances in Neural Information Processing Systems, vol. 35, pp. 38274-38290, 2022

  41. [41]

    A Critical Review of Methods and Challenges in Large Language Models,

    M. Moradi, K. Yan, D. Colwell, M. Samwald, and R. Asgari, "A Critical Review of Methods and Challenges in Large Language Models," Computers, Materials and Continua, vol. 82, pp. 1681-1698, 2025

  42. [42]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, et al., "Lora: Low-rank adaptation of large language models," Iclr, vol. 1, p. 3, 2022

  43. [43]

    Low-rank adaptation for foundation models: A comprehensive review.arXiv preprint arXiv:2501.00365,

    M. Yang, J. Chen, J. Tao, Y. Zhang, J. Liu, J. Zhang , et al. , "Low -rank adaptation for foundation models: A comprehensive review," arXiv preprint arXiv:2501.00365, 2024

  44. [44]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu , et al. , "Deepseek-v3 technical report," arXiv preprint arXiv:2412.19437, 2024. 31

  45. [45]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. -A. Lachaux, T. Lacroix , et al. , "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023

  46. [46]

    A survey of reinforcement learning from human feedback,

    T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, "A survey of reinforcement learning from human feedback," Transactions on Machine Learning Research, 2024

  47. [47]

    M. AI. (01/10/2025). Mistral documentation. Available: https://docs.mistral.ai/

  48. [48]

    GQA: Training Generalized Multi -Query Transformer Models from Multi -Head Checkpoints,

    J. Ainslie, J. Lee -Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, "GQA: Training Generalized Multi -Query Transformer Models from Multi -Head Checkpoints," Singapore, 2023, pp. 4895-4901

  49. [49]

    A Closer Look into Using Large Language Models for Automatic Evaluation,

    C.-H. Chiang and H. -y. Lee, "A Closer Look into Using Large Language Models for Automatic Evaluation," Singapore, 2023, pp. 8928-8942

  50. [50]

    Large Language Model -Powered Automated Assessment: A Systematic Review,

    E. Emirtekin, "Large Language Model -Powered Automated Assessment: A Systematic Review," Applied Sciences, vol. 15, p. 5683, 2025

  51. [51]

    Grade Like a Human: Rethinking Automated Assessment with Large Language Models,

    W. Xie, J. Niu, C. J. Xue, and N. Guan, "Grade Like a Human: Rethinking Automated Assessment with Large Language Models," presented at the Proceedings of the International Conference on Research in Adaptive and Convergent Systems, 2026

  52. [52]

    Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation,

    T. Vu, K. Krishna, S. Alzubi, C. Tar, M. Faruqui, and Y.-H. Sung, "Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation," Miami, Florida, USA, 2024, pp. 17086-17105. 32 Appendix A The prompt used for generating test cases from test requirements. This prompt was employed during fine-tuning the LLMs, also during inference wh...

  53. [53]

    Requirements may involve templates, configuration fields, user permissions, workflows, or API interactions

    [System response or validation] - If API testing is involved, include: 33 - HTTP method and endpoint - Payload and headers (if applicable) - Expected response and status code - Any conditions, edge cases, or validation rules Assume that the system includes both a web -based UI and RESTful APIs. Requirements may involve templates, configuration fields, use...

  54. [54]

    Focus on *overall intent* and key concepts, not literal wording

    Semantic Similarity: How closely the meaning of the model - generated test case matches the ground truth. Focus on *overall intent* and key concepts, not literal wording

  55. [55]

    Information Coverage: How completely the generated output includes all the key details, preconditions, actions, expected outcomes, and edge cases present in the ground truth. 35

  56. [56]

    Critical Content Match: Whether the model has preserved *must - have* elements (e.g., specific actions, roles, UI elements, data types) from the requirement or ground truth

  57. [57]

    Structural and Format Accuracy: Whether the output is well - structured and conforms to expected formatting: - Clear test case title and type - Actionable, step-by-step instructions - Coherent order of steps - Consistent formatting

  58. [58]

    Higher scores mean minimal omissions

    Omission: Degree to which important information is *missing*. Higher scores mean minimal omissions

  59. [59]

    Higher scores mean fewer or no hallucinations

    Hallucination: Degree to which the model *invents irrelevant or unsupported content* not present in the requirement. Higher scores mean fewer or no hallucinations

  60. [60]

    Higher scores mean the test case is easy to understand and unambiguous

    Ambiguity: Clarity and precision of language used. Higher scores mean the test case is easy to understand and unambiguous

  61. [61]

    Higher scores mean the output is concise and avoids redundant content

    Redundancy: Whether the test case contains unnecessary repetition. Higher scores mean the output is concise and avoids redundant content

  62. [62]

    semantic_similarity

    Diversity and Novelty: If applicable, does the generated test case introduce valid, logically sound variations or interpretations of the requirement that differ meaningfully from the ground truth? == INPUT == ### Requirement - Name: {requirement_name} - Description: {requirement_description} ### Model-Generated Test Case {model_response} 36 ### Ground Tru...