pith. sign in

arxiv: 2605.28321 · v1 · pith:BTWL6HNLnew · submitted 2026-05-27 · 💻 cs.SE · cs.AI

Multi-Agent LLM-based Metamorphic Testing for REST APIs

Pith reviewed 2026-06-29 10:50 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords metamorphic testingREST APIsLLM agentsmulti-agent workflowOpenAPItest oracle problemAPI testingsoftware quality
0
0 comments X

The pith

ARMeta uses a multi-agent LLM workflow to generate metamorphic tests that complement scenario-based testing for REST APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARMeta, a tool that applies metamorphic testing to REST APIs specified in OpenAPI by using multiple LLM agents to create test scenarios. Metamorphic testing checks consistency between related outputs instead of requiring a known correct result for each call. The agents identify relations, express them in Given-When-Then format, turn them into runnable tests, and execute those tests on the target system. When run on two public web applications, the generated tests uncovered behaviors not reached by a standard scenario-based testing method. A reader would care because many REST APIs lack clear test oracles, making it hard to know whether an output is right.

Core claim

ARMeta is a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

What carries the argument

LLM-based multi-agent workflow that identifies metamorphic relations and encodes them as executable Given-When-Then tests for OpenAPI-described REST APIs.

If this is right

  • Metamorphic testing can be applied to REST APIs even when no explicit oracle exists for individual responses.
  • The generated tests run automatically once the scenarios are turned into code.
  • The method can be compared directly against scenario-based testing on any OpenAPI-documented service.
  • Additional behaviors become reachable without manual definition of every test case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the agent workflow scales without extra human review, manual effort for writing metamorphic relations could drop for teams that already use OpenAPI specs.
  • The same multi-agent pattern might transfer to property-based testing or consistency checks in other API styles such as GraphQL.
  • Integration into CI pipelines would allow repeated runs to catch behavioral drift after code changes.

Load-bearing premise

The LLM agents reliably produce valid metamorphic relations that do not introduce false positives or miss critical behaviors when applied to real REST APIs.

What would settle it

An execution of ARMeta on one of the two evaluated applications in which a generated metamorphic relation produces a false positive or fails to flag a known fault in the API would show the approach does not reliably complement the baseline.

Figures

Figures reproduced from arXiv: 2605.28321 by Abdullah Mughees, Dragos Truscan, Gaadha Sudheerbabu, Shehroz Khan, Tanwir Ahmad.

Figure 1
Figure 1. Figure 1: Example of High-level MT scenario in GWT form (a) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level multi-agent architecture approach is iterative and incremental. A test generation session is composed of several iterations. In each iteration, a set of metamorphic tests is generated and executed. If the stopping criteria for test generation are not met, a subsequent iteration is initiated. Concretely, the Test Manager takes the REST API specifi￾cation and the test generation parameters as inpu… view at source ↗
Figure 3
Figure 3. Figure 3: ARMeta tool displaying the passed and failed scenarios. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents ARMeta, a multi-agent LLM workflow that generates metamorphic test scenarios in Given-When-Then format from OpenAPI specifications for REST APIs, automatically implements them as executable tests, and evaluates the approach on two public web applications against a scenario-based testing baseline, claiming that the generated behaviors serve as a complement to existing methods.

Significance. If the empirical results hold after proper validation of the metamorphic relations and fair baseline comparison, the work could advance automated testing for systems with difficult oracles by reducing the manual burden of specifying relations and demonstrating practical complementarity in REST API testing.

major comments (2)
  1. [Abstract] Abstract: the central claim of complementarity rests on an empirical comparison, but the abstract (and by extension the evaluation) provides no details on how the generated metamorphic relations were validated for correctness, what metrics quantified the explored behaviors, or how the scenario-based baseline was matched in scope and effort.
  2. [Evaluation] Evaluation (implied by abstract): without reported checks for false positives introduced by LLM-generated relations or coverage of critical behaviors missed by the baseline, the complementarity result cannot be assessed for reliability on real REST APIs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our empirical claims. We address each major comment below and commit to revisions that strengthen the presentation of our validation and comparison methods without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of complementarity rests on an empirical comparison, but the abstract (and by extension the evaluation) provides no details on how the generated metamorphic relations were validated for correctness, what metrics quantified the explored behaviors, or how the scenario-based baseline was matched in scope and effort.

    Authors: We agree that the abstract, as a concise summary, should better support the complementarity claim with high-level indicators of the underlying methods. The current evaluation describes the workflow and results on the two applications but does not explicitly detail validation steps, specific metrics, or baseline matching criteria in one place. In the revision we will update the abstract with a brief clause noting expert review for relation correctness, use of coverage and relation-count metrics, and scope matching via identical OpenAPI inputs and comparable scenario volumes; the evaluation section will be expanded with a dedicated paragraph consolidating these elements. revision: yes

  2. Referee: [Evaluation] Evaluation (implied by abstract): without reported checks for false positives introduced by LLM-generated relations or coverage of critical behaviors missed by the baseline, the complementarity result cannot be assessed for reliability on real REST APIs.

    Authors: We acknowledge that explicit reporting of false-positive mitigation and differential coverage analysis would improve assessability of the complementarity result. The manuscript currently reports aggregate outcomes and qualitative complementarity but does not include a systematic false-positive audit or enumeration of behaviors uniquely missed by the baseline. We will add these in revision by describing the manual filtering process used to discard invalid relations and by including a table or subsection contrasting unique state transitions and endpoint behaviors covered by ARMeta versus the baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical tool (ARMeta) and its evaluation on two public REST applications against a scenario-based baseline. The central claim of complementarity rests on observed behaviors in that external comparison, with no equations, fitted parameters, self-defined quantities, or load-bearing self-citations that reduce the result to its own inputs by construction. The derivation chain is self-contained as a standard empirical study rather than a mathematical or definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM agents can be prompted to produce correct metamorphic relations without systematic bias or hallucination; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption LLM agents can be orchestrated to produce valid Given-When-Then metamorphic scenarios for REST APIs
    Invoked in the description of the agentic workflow; no independent verification of agent output correctness is described in the abstract.

pith-pipeline@v0.9.1-grok · 5733 in / 1177 out tokens · 29693 ms · 2026-06-29T10:50:52.879252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages

  1. [1]

    Uses and applications of the openapi/swagger specification: a systematic mapping of the literature,

    S. Casas, D. Cruz, G. Vidal, and M. Constanzo, “Uses and applications of the openapi/swagger specification: a systematic mapping of the literature,” in2021 40th International Conference of the Chilean Computer Science Society (SCCC). IEEE, 2021, pp. 1–8

  2. [2]

    Metamorphic testing: a new approach for generating next test cases,

    T. Y . Chenet al., “Metamorphic testing: a new approach for generating next test cases,”arXiv preprint arXiv:2002.12543, 2020

  3. [3]

    A new method for constructing metamorphic relations,

    H. Liuet al., “A new method for constructing metamorphic relations,” in12th International Conference on Quality Software. IEEE, 2012, pp. 59–68

  4. [4]

    Metamorphic testing for web services: Framework and a case study,

    C.-A. Sun, G. Wang, B. Mu, H. Liu, Z. Wang, and T. Y . Chen, “Metamorphic testing for web services: Framework and a case study,” in 2011 IEEE international Conference on Web Services. IEEE, 2011, pp. 283–290

  5. [5]

    Metamorphic testing of restful web apis,

    S. Segura, J. Parejo, J. Troya, and A. Ruiz-Cort ´es, “Metamorphic testing of restful web apis,”IEEE Transactions on Software Engineering, vol. PP, pp. 1–1, 10 2017

  6. [6]

    Metamorphic testing: A review of challenges and opportunities,

    T. Y . Chen, F. C. Kuo, H. Liu, P. L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–27, 2018

  7. [7]

    A survey on metamorphic testing,

    S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort ´es, “A survey on metamorphic testing,”IEEE Transactions on Software Engineering, vol. 42, no. 9, pp. 805–824, 2016

  8. [8]

    Can chatgpt advance software testing intelligence? an experience report on metamorphic testing,

    Q. H. Luu, H. Liu, and T. Y . Chen, “Can chatgpt advance software testing intelligence? an experience report on metamorphic testing,”arXiv preprint arXiv:2310.19204, 2023

  9. [9]

    Automated metamorphic-relation generation with chatgpt,

    Y . Zhang, D. Towey, and M. Pike, “Automated metamorphic-relation generation with chatgpt,” inProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2023, pp. 1–6

  10. [10]

    Towards generating executable metamorphic relations using large language models,

    S. Y . Shin, F. Pastore, D. Bianculli, and A. Baicoianu, “Towards generating executable metamorphic relations using large language models,” inInternational Conference on the Quality of Information and Communications Technology. Springer, 2024, pp. 126–141

  11. [11]

    Behave: Behavior-driven development (bdd) framework for python,

    B. Rice and R. Jones, “Behave: Behavior-driven development (bdd) framework for python,” https://pypi.org/project/behave/, Python Package Index (PyPI), 2025, version 1.3.3

  12. [12]

    Streamlit documentation,

    “Streamlit documentation,” https://docs.streamlit.io/, Accessed: December 2025

  13. [13]

    Testing restful apis: A survey,

    A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing restful apis: A survey,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 1, pp. 1–41, 2023

  14. [14]

    Logiagent: Automated logical testing for rest systems with llm-based multi-agents,

    K. Zhanget al., “Logiagent: Automated logical testing for rest systems with llm-based multi-agents,”arXiv preprint arXiv:2503.15079, 2025

  15. [15]

    Restifai: Llm-based workflow for reusable rest api testing,

    L. Kogler, M. Ehrhart, B. Dornauer, and E. P. Enoiu, “Restifai: Llm-based workflow for reusable rest api testing,”arXiv preprint arXiv:2512.08706, 2025

  16. [16]

    Liang et al

    L. Liang, C. Tan, Y . Deng, Y . Cai, T. Y . Chen, and X. Zheng, “Automt: A multi-agent llm framework for automated metamorphic testing of autonomous driving systems,”arXiv preprint arXiv:2510.19438v1, 2025

  17. [17]

    Atil et al

    B. Atil, S. Aykent, A. Chittams, L. Fu, R. J. Passonneau, E. Radcliffe, G. R. Rajagopal, A. Sloan, T. Tudrej, F. Tureet al., “Non-determinism of ”deterministic” llm settings,”arXiv preprint arXiv:2408.04667, 2024

  18. [18]

    Wynne and A

    M. Wynne and A. Hellesoy,”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012

  19. [19]

    [Online]

    Behave Development Team,Behave Documentation, Read the Docs, 2026, accessed: 2026-01-29. [Online]. Available: https: //behave.readthedocs.io/en/latest/

  20. [20]

    CrewAI documentation,

    CrewAI, “CrewAI documentation,” https://docs.crewai.com/, 2024, ac- cessed: 2025-02-15

  21. [21]

    Streamlit: A faster way to build and share data apps,

    Streamlit Inc., “Streamlit: A faster way to build and share data apps,” https://streamlit.io, 2023, accessed: 2026-02-06

  22. [22]

    Swagger petstore,

    Swagger API Team, “Swagger petstore,” https://github.com/swagger-api/ swagger-petstore, accessed: December 2025

  23. [23]

    Microservice-rbac-user-management: Role-Based Access Control User Management Microservice,

    Andrea Giassi, “Microservice-rbac-user-management: Role-Based Access Control User Management Microservice,” https://github.com/andreagiassi/ microservice-rbac-user-management, Accessed: December 2025

  24. [24]

    Automated test generation for rest apis: No time to rest yet,

    M. Kim, Q. Xin, S. Sinha, and A. Orso, “Automated test generation for rest apis: No time to rest yet,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 289–301

  25. [25]

    Armeta - replication package,

    “Armeta - replication package,” https://gitlab.abo.fi/virtualseatrial-public/ ARMETA, accessed: 2026-05-26