Test Design and Review Argumentation in AI-Assisted Test Generation

Eduard Paul Enoiu; Robert Feldt

arxiv: 2604.22473 · v1 · submitted 2026-04-24 · 💻 cs.SE

Test Design and Review Argumentation in AI-Assisted Test Generation

Eduard Paul Enoiu , Robert Feldt This is my paper

Pith reviewed 2026-05-08 11:23 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI-assisted test generationtest designtest reviewargumentationsoftware testingtest taxonomyevidence-based testing

0 comments

The pith

A taxonomy and template represent each AI-generated test case through its goal, claim, reason, and evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a structured way to capture the reasoning behind individual test cases produced with AI assistance. Each test is characterized by its goal, the claim it advances, the reason for its selection, and the evidence that supports it. This matters because current AI tools can create tests but leave engineers without a clear view of why those tests were chosen or what justifies them. The taxonomy supports both building tests with explicit arguments and reviewing them afterward by evaluating the quality of that argument. It moves attention away from whether a test looks plausible on its own toward whether its attached justification holds up.

Core claim

The central claim is that a conceptual taxonomy and structured template for AI-assisted test generation can characterize a test case by its test goal, claim, reason, and evidence. This structure is meant for use both when designing tests and when reviewing them later, allowing assessment of the quality of the attached argument rather than the plausibility or objective value of the generated test cases themselves.

What carries the argument

The conceptual taxonomy and structured template that decompose each test case into four elements: test goal, claim, reason, and evidence. It carries the argument by making the justification for each design decision explicit and inspectable.

If this is right

AI test generators can be guided to produce not only the test but also its supporting goal, claim, reason, and evidence.
Reviewers can evaluate the strength of the justification instead of judging the test in isolation.
The same four-element breakdown can be applied both while constructing tests and while inspecting them after generation.
Assessment of test quality shifts focus from surface plausibility to the coherence of the attached argument.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Requiring AI tools to output these structures by default could increase traceability in test suites without extra manual effort.
The same decomposition might apply to other AI-generated artifacts such as code or specifications where justification is important.
Controlled experiments measuring review time and defect detection rates with versus without the template would provide direct evidence of practical benefit.

Load-bearing premise

Explicitly representing the argumentation behind individual test design decisions will enable engineers to better understand and assess the quality of AI-generated tests during both design and review.

What would settle it

A study in which engineers review identical sets of AI-generated tests with and without the goal-claim-reason-evidence structure attached, then measure whether the structure produces measurable differences in comprehension or quality assessment.

Figures

Figures reproduced from arXiv: 2604.22473 by Eduard Paul Enoiu, Robert Feldt.

**Figure 1.** Figure 1: Conceptual Model of a Test Design and Review Argument. view at source ↗

read the original abstract

AI assistants can increasingly generate and evolve test cases. The challenge is no longer merely to produce them, but also to help engineers understand why a generated artefact exists and what supports it. Existing work has focused on classifying testing techniques, linking requirements to tests and structuring system assurance arguments, but it does not explicitly represent the argumentation behind individual test design decisions. We propose a conceptual taxonomy and a structured template for AI-assisted test generation that characterizes a test case by its test goal, claim, reason, and evidence. The taxonomy is intended for both constructive use during test design and retrospective use during review, to assess the quality of the attached argument rather than the plausibility or objective value of the generated test cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper proposes a four-part template for documenting the rationale behind individual AI-generated tests but stays strictly at the conceptual level.

read the letter

The main takeaway is a straightforward taxonomy that breaks each test case into goal, claim, reason, and evidence. The authors position this as a way to make the reasoning explicit during both creation and review of AI-produced tests, rather than just evaluating the test output itself. That focus on per-test argumentation is the piece that feels new compared to broader work on technique taxonomies or system assurance cases. The proposal is motivated by a genuine issue in the field: AI test generators are getting better at producing artifacts, but engineers still need help understanding why a particular test was generated and what supports it. The template itself is simple and could be applied without new tooling. The soft spot is that the paper provides no examples of the template in use, no discussion of how it would work with current AI systems, and no evidence that it improves understanding or review quality. Everything rests on the framing that explicit arguments will help, which is plausible but untested here. This is the sort of paper that would interest researchers working on explainable AI for software engineering or practitioners looking for lightweight ways to structure test documentation. It is not aimed at readers seeking empirical results or implemented solutions. I would send it to peer review. The core idea is clear and the gap it identifies is real, so referees could usefully push for concrete illustrations or a small pilot to move it forward.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes a conceptual taxonomy and structured template for AI-assisted test generation. It argues that existing work on classifying testing techniques, linking requirements to tests, and structuring system assurance arguments does not explicitly represent the argumentation behind individual test design decisions. The authors introduce a four-part characterization of each test case—test goal, claim, reason, and evidence—intended to support both constructive use during design and retrospective use during review, with the focus on assessing the quality of the attached argument rather than the objective value of the test.

Significance. If adopted, the taxonomy could provide a practical structuring device for improving transparency and reviewability of AI-generated tests, addressing a genuine gap between automated generation and human oversight in software engineering. The work explicitly identifies limitations in prior literature on testing classification and assurance arguments and offers a lightweight, reusable template as a response. As a purely conceptual contribution with no empirical data, examples, or validation, its significance will depend on community uptake and subsequent empirical studies.

minor comments (3)

The abstract states the taxonomy is for both design and review but provides no illustrative example of applying the four elements (goal, claim, reason, evidence) to a concrete test case; adding one short worked example would clarify the template's intended use.
The manuscript would benefit from explicit mapping or comparison of the proposed elements to at least one existing assurance-case notation (e.g., GSN or CAE) to demonstrate interoperability.
Terminology such as 'test goal' and 'claim' is introduced without a dedicated definitions subsection; a small table or glossary would improve precision and reduce potential reader ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful summary of our contribution and the recommendation for minor revision. The report accurately captures the scope and intent of the taxonomy as a lightweight structuring device for test design argumentation rather than an empirical evaluation of test quality.

Circularity Check

0 steps flagged

No circularity in conceptual taxonomy proposal

full rationale

The paper advances a definitional taxonomy and template that decomposes each test case into test goal, claim, reason, and evidence for use in AI-assisted test generation and review. No equations, derivations, fitted parameters, predictions, or quantitative claims appear anywhere in the manuscript. The central contribution is presented explicitly as a structuring device motivated by gaps in existing literature on testing techniques and assurance arguments, without any reduction of the proposed structure to its own inputs, self-citations that bear the load of the claim, or renaming of prior results. The proposal remains self-contained as an independent conceptual framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper contributes a new conceptual structure motivated by a stated gap in prior literature; no free parameters or external data are involved.

axioms (1)

domain assumption Existing work has focused on classifying testing techniques, linking requirements to tests and structuring system assurance arguments, but it does not explicitly represent the argumentation behind individual test design decisions.
This gap statement in the abstract serves as the primary motivation and premise for the proposal.

invented entities (1)

Taxonomy and template with test goal, claim, reason, and evidence no independent evidence
purpose: To characterize test cases for constructive use in design and retrospective use in review of AI-assisted test generation
Newly defined structure introduced to address the identified gap; no independent evidence or prior reference provided in abstract.

pith-pipeline@v0.9.0 · 5408 in / 1253 out tokens · 26210 ms · 2026-05-08T11:23:28.735234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

LLM-based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “LLM-based test-driven interactive code generation: User study and empirical evaluation,”Transactions on Software Engineering, vol. 50, no. 9, pp. 2254–2268, 2024

work page 2024
[2]

AgentTester: An LLM-based tool for unit test generation with automatically gener- ated prompts,

H. Chen, K. Chen, F. Zhang, T. Wang, and L. Cheng, “AgentTester: An LLM-based tool for unit test generation with automatically gener- ated prompts,” inInternational Conference on Intelligent Computing. Springer, 2025, pp. 114–126

work page 2025
[3]

Mutation-guided LLM-based test generation at meta,

M. Harman, J. Ritchey, I. Harper, S. Sengupta, K. Mao, A. Gulati, C. Foster, and H. Robert, “Mutation-guided LLM-based test generation at meta,” inInternational Conference on the Foundations of Software Engineering. ACM, 2025, pp. 180–191

work page 2025
[4]

Understanding on the edge: LLM-generated boundary test explanations,

S. Akbarova, F. Dobslaw, and R. Feldt, “Understanding on the edge: LLM-generated boundary test explanations,”arXiv preprint arXiv:2601.22791, 2026

work page arXiv 2026
[5]

Ethical challenges and software test automation,

P. E. Strandberg, E. P. Enoiu, and M. Frasheri, “Ethical challenges and software test automation,”AI and Ethics, vol. 5, no. 6, pp. 6185–6206, 2025

work page 2025
[6]

Four principles of explainable artificial intelligence,

P. J. Phillips, C. A. Hahn, P. C. Fontana, A. N. Yates, K. Greene, D. A. Broniatowski, and M. A. Przybocki, “Four principles of explainable artificial intelligence,”NISTIR 8312 Report, 2021

work page 2021
[7]

The global landscape of AI ethics guidelines,

A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI ethics guidelines,”Nature Machine Intelligence, vol. 1, no. 9, pp. 389–399, 2019

work page 2019
[8]

Ammann and J

P. Ammann and J. Offutt,Introduction to software testing. Cambridge University Press, 2016

work page 2016
[9]

The role of experience in software testing practice,

A. Beer and R. Ramler, “The role of experience in software testing practice,” inEuromicro Conference Software Engineering and Advanced Applications. IEEE, 2008, pp. 258–265. [10]IEEE Standard for Software Quality Assurance Plans (730-1981). USA: IEEE, 1981-11-13

work page 2008
[10]

Alignment of requirements specification and testing: A systematic mapping study,

Z. A. Barmi, A. H. Ebrahimi, and R. Feldt, “Alignment of requirements specification and testing: A systematic mapping study,” inInternational Conference on Software Testing, Verification and Validation Workshops. IEEE, 2011, pp. 476–485

work page 2011
[11]

A rationale-based architecture model for design traceability and reasoning,

A. Tang, Y . Jin, and J. Han, “A rationale-based architecture model for design traceability and reasoning,”Journal of Systems and Software, vol. 80, no. 6, pp. 918–934, 2007

work page 2007
[12]

Iso/iec/ieee 29119-1:2013(e): Software and systems engineering soft- ware testing part 1:concepts and [elektronisk resurs]

“Iso/iec/ieee 29119-1:2013(e): Software and systems engineering soft- ware testing part 1:concepts and [elektronisk resurs]...” 2013

work page 2013
[13]

Spillner and T

A. Spillner and T. Linz,Software testing foundations: A study guide for the certified tester exam-foundation level-ISTQB® compliant. dpunkt. verlag, 2021

work page 2021
[14]

Development of the 2nd edition of the iso 26262,

G. Griessnig and A. Schnellbach, “Development of the 2nd edition of the iso 26262,” inEuropean Conference on Software Process Improvement. Springer, 2017, pp. 535–546

work page 2017
[15]

50128: Railway Application–Communications, Signaling and Processing Systems–Software for Railway Control and Protection Systems,

CENELEC, “50128: Railway Application–Communications, Signaling and Processing Systems–Software for Railway Control and Protection Systems,” inStandard Report, 2001

work page 2001
[16]

Spriggs,GSN-the goal structuring notation: A structured approach to presenting arguments

J. Spriggs,GSN-the goal structuring notation: A structured approach to presenting arguments. Springer Science & Business Media, 2012

work page 2012
[17]

How do testers do it? an exploratory study on manual testing practices,

J. Itkonen, M. V . Mantyla, and C. Lassenius, “How do testers do it? an exploratory study on manual testing practices,” inInternational Sym- posium on Empirical Software Engineering and Measurement. IEEE, 2009, pp. 494–497

work page 2009
[18]

Towards a model of testers’ cognitive processes: Software testing as a problem solving approach,

E. Enoiu, G. Tukseferi, and R. Feldt, “Towards a model of testers’ cognitive processes: Software testing as a problem solving approach,” inQRS. IEEE, 2020, pp. 272–279

work page 2020
[19]

J. E. Burge, J. M. Carroll, R. McCall, and I. Mistrik,Rationale-based software engineering. Springer, 2008

work page 2008
[20]

W. C. Booth, G. G. Colomb, and J. M. Williams,The craft of research. University of Chicago press, 2009

work page 2009
[21]

Understanding problem solving in software testing: An exploration of tester routines and behavior,

E. P. Enoiu, G. Gay, J. Esber, and R. Feldt, “Understanding problem solving in software testing: An exploration of tester routines and behavior,” inIFIP International Conference on Testing Software and Systems. Springer, 2023, pp. 143–159

work page 2023
[22]

The role of the tester’s knowledge in exploratory software testing,

J. Itkonen, M. V . M ¨antyl¨a, and C. Lassenius, “The role of the tester’s knowledge in exploratory software testing,”Transactions on Software Engineering, vol. 39, no. 5, pp. 707–724, 2012

work page 2012
[23]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” inInternational Conference on Software Engineering: Future of Software Engineering. IEEE, 2023, pp. 31–53

work page 2023
[24]

Legal Interpretation,

M. Greenberg, “Legal Interpretation,” inThe Stanford Encyclopedia of Philosophy, Fall 2021 ed., E. N. Zalta, Ed. Metaphysics Research Lab, Stanford University, 2021

work page 2021
[25]

Empirical evaluation on fbd model- based test coverage criteria using mutation analysis,

D. Shin, E. Jee, and D.-H. Bae, “Empirical evaluation on fbd model- based test coverage criteria using mutation analysis,” inInternational Conference on Model Driven Engineering Languages and Systems. Springer, 2012, pp. 465–479

work page 2012

[1] [1]

LLM-based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “LLM-based test-driven interactive code generation: User study and empirical evaluation,”Transactions on Software Engineering, vol. 50, no. 9, pp. 2254–2268, 2024

work page 2024

[2] [2]

AgentTester: An LLM-based tool for unit test generation with automatically gener- ated prompts,

H. Chen, K. Chen, F. Zhang, T. Wang, and L. Cheng, “AgentTester: An LLM-based tool for unit test generation with automatically gener- ated prompts,” inInternational Conference on Intelligent Computing. Springer, 2025, pp. 114–126

work page 2025

[3] [3]

Mutation-guided LLM-based test generation at meta,

M. Harman, J. Ritchey, I. Harper, S. Sengupta, K. Mao, A. Gulati, C. Foster, and H. Robert, “Mutation-guided LLM-based test generation at meta,” inInternational Conference on the Foundations of Software Engineering. ACM, 2025, pp. 180–191

work page 2025

[4] [4]

Understanding on the edge: LLM-generated boundary test explanations,

S. Akbarova, F. Dobslaw, and R. Feldt, “Understanding on the edge: LLM-generated boundary test explanations,”arXiv preprint arXiv:2601.22791, 2026

work page arXiv 2026

[5] [5]

Ethical challenges and software test automation,

P. E. Strandberg, E. P. Enoiu, and M. Frasheri, “Ethical challenges and software test automation,”AI and Ethics, vol. 5, no. 6, pp. 6185–6206, 2025

work page 2025

[6] [6]

Four principles of explainable artificial intelligence,

P. J. Phillips, C. A. Hahn, P. C. Fontana, A. N. Yates, K. Greene, D. A. Broniatowski, and M. A. Przybocki, “Four principles of explainable artificial intelligence,”NISTIR 8312 Report, 2021

work page 2021

[7] [7]

The global landscape of AI ethics guidelines,

A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI ethics guidelines,”Nature Machine Intelligence, vol. 1, no. 9, pp. 389–399, 2019

work page 2019

[8] [8]

Ammann and J

P. Ammann and J. Offutt,Introduction to software testing. Cambridge University Press, 2016

work page 2016

[9] [9]

The role of experience in software testing practice,

A. Beer and R. Ramler, “The role of experience in software testing practice,” inEuromicro Conference Software Engineering and Advanced Applications. IEEE, 2008, pp. 258–265. [10]IEEE Standard for Software Quality Assurance Plans (730-1981). USA: IEEE, 1981-11-13

work page 2008

[10] [10]

Alignment of requirements specification and testing: A systematic mapping study,

Z. A. Barmi, A. H. Ebrahimi, and R. Feldt, “Alignment of requirements specification and testing: A systematic mapping study,” inInternational Conference on Software Testing, Verification and Validation Workshops. IEEE, 2011, pp. 476–485

work page 2011

[11] [11]

A rationale-based architecture model for design traceability and reasoning,

A. Tang, Y . Jin, and J. Han, “A rationale-based architecture model for design traceability and reasoning,”Journal of Systems and Software, vol. 80, no. 6, pp. 918–934, 2007

work page 2007

[12] [12]

Iso/iec/ieee 29119-1:2013(e): Software and systems engineering soft- ware testing part 1:concepts and [elektronisk resurs]

“Iso/iec/ieee 29119-1:2013(e): Software and systems engineering soft- ware testing part 1:concepts and [elektronisk resurs]...” 2013

work page 2013

[13] [13]

Spillner and T

A. Spillner and T. Linz,Software testing foundations: A study guide for the certified tester exam-foundation level-ISTQB® compliant. dpunkt. verlag, 2021

work page 2021

[14] [14]

Development of the 2nd edition of the iso 26262,

G. Griessnig and A. Schnellbach, “Development of the 2nd edition of the iso 26262,” inEuropean Conference on Software Process Improvement. Springer, 2017, pp. 535–546

work page 2017

[15] [15]

50128: Railway Application–Communications, Signaling and Processing Systems–Software for Railway Control and Protection Systems,

CENELEC, “50128: Railway Application–Communications, Signaling and Processing Systems–Software for Railway Control and Protection Systems,” inStandard Report, 2001

work page 2001

[16] [16]

Spriggs,GSN-the goal structuring notation: A structured approach to presenting arguments

J. Spriggs,GSN-the goal structuring notation: A structured approach to presenting arguments. Springer Science & Business Media, 2012

work page 2012

[17] [17]

How do testers do it? an exploratory study on manual testing practices,

J. Itkonen, M. V . Mantyla, and C. Lassenius, “How do testers do it? an exploratory study on manual testing practices,” inInternational Sym- posium on Empirical Software Engineering and Measurement. IEEE, 2009, pp. 494–497

work page 2009

[18] [18]

Towards a model of testers’ cognitive processes: Software testing as a problem solving approach,

E. Enoiu, G. Tukseferi, and R. Feldt, “Towards a model of testers’ cognitive processes: Software testing as a problem solving approach,” inQRS. IEEE, 2020, pp. 272–279

work page 2020

[19] [19]

J. E. Burge, J. M. Carroll, R. McCall, and I. Mistrik,Rationale-based software engineering. Springer, 2008

work page 2008

[20] [20]

W. C. Booth, G. G. Colomb, and J. M. Williams,The craft of research. University of Chicago press, 2009

work page 2009

[21] [21]

Understanding problem solving in software testing: An exploration of tester routines and behavior,

E. P. Enoiu, G. Gay, J. Esber, and R. Feldt, “Understanding problem solving in software testing: An exploration of tester routines and behavior,” inIFIP International Conference on Testing Software and Systems. Springer, 2023, pp. 143–159

work page 2023

[22] [22]

The role of the tester’s knowledge in exploratory software testing,

J. Itkonen, M. V . M ¨antyl¨a, and C. Lassenius, “The role of the tester’s knowledge in exploratory software testing,”Transactions on Software Engineering, vol. 39, no. 5, pp. 707–724, 2012

work page 2012

[23] [23]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” inInternational Conference on Software Engineering: Future of Software Engineering. IEEE, 2023, pp. 31–53

work page 2023

[24] [24]

Legal Interpretation,

M. Greenberg, “Legal Interpretation,” inThe Stanford Encyclopedia of Philosophy, Fall 2021 ed., E. N. Zalta, Ed. Metaphysics Research Lab, Stanford University, 2021

work page 2021

[25] [25]

Empirical evaluation on fbd model- based test coverage criteria using mutation analysis,

D. Shin, E. Jee, and D.-H. Bae, “Empirical evaluation on fbd model- based test coverage criteria using mutation analysis,” inInternational Conference on Model Driven Engineering Languages and Systems. Springer, 2012, pp. 465–479

work page 2012