Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

Abdullah Mughees; Ashir Kulshreshtha; Dragos Truscan; Gaadha Sudheerbabu; Kristian Klemets; Mikael Manng{\aa}rd; Tanwir Ahmad

arxiv: 2605.25101 · v1 · pith:T3OCOWKLnew · submitted 2026-05-24 · 💻 cs.SE · cs.AI· cs.SY· eess.SY

Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

Ashir Kulshreshtha , Abdullah Mughees , Gaadha Sudheerbabu , Tanwir Ahmad , Kristian Klemets , Dragos Truscan , Mikael Manng{\aa}rd This is my paper

Pith reviewed 2026-06-29 23:50 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.SYeess.SY

keywords metamorphic testingFMUFMImulti-agent systemsLLMsimulation testingtest oraclesGiven-When-Then

0 comments

The pith

A multi-agent LLM workflow derives Given-When-Then metamorphic relations from FMU specifications to generate tests for simulation models lacking explicit oracles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an automated workflow where multiple LLM agents process functional and interface specifications of FMU-based simulations to extract requirements and create metamorphic relations. These relations are formatted in Given-When-Then style to define input conditions, transformations, and expected behaviors, enabling test case generation and execution without traditional oracles. Evaluation on a Lube Oil Cooling system FMU shows the system can produce meaningful relations and tests, supporting verification by reducing manual effort in handling the oracle problem for dynamic models. This matters because FMI allows model exchange across tools, but testing remains hard without expected outputs. The approach addresses the manual and error-prone nature of extracting metamorphic relations manually.

Core claim

The central claim is that an LLM-powered multi-agent system can take functional and interface specifications as input, orchestrate agents to extract requirements and derive metamorphic relations in Given-When-Then patterns, use these to generate and execute metamorphic test cases on FMUs, and evaluate output consistency, thereby enabling systematic verification and validation of simulation models.

What carries the argument

The multi-agent LLM workflow that extracts requirements from specifications and derives metamorphic relations expressed using Given-When-Then patterns for structuring input conditions, transformations, and expected behaviors.

If this is right

Automatically generated MRs can be used to create metamorphic test cases for FMU simulations.
The workflow evaluates consistency of outputs across multiple simulation sessions.
It supports verification of dynamic simulation models exchanged via FMI.
Preliminary evaluation indicates reduction in manual effort for test generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend to other types of simulation models beyond FMUs if the specification extraction works similarly.
Integrating human review steps might improve accuracy of the derived relations.
Applying the workflow to additional industrial FMU examples would test its generalizability.

Load-bearing premise

The multi-agent LLM system can accurately extract requirements from specifications and derive valid, meaningful metamorphic relations in Given-When-Then format without substantial human correction.

What would settle it

A demonstration that many of the automatically derived metamorphic relations for the Lube Oil Cooling system FMU are invalid or do not hold when checked against actual simulation behavior would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.25101 by Abdullah Mughees, Ashir Kulshreshtha, Dragos Truscan, Gaadha Sudheerbabu, Kristian Klemets, Mikael Manng{\aa}rd, Tanwir Ahmad.

**Figure 1.** Figure 1: Overview of the Multi-Agent workflow The approach takes as input the simulation model, packed as an FMI-standard compliant FMU, which comprises the functional specification, model description and execution binaries to facilitate the metamorphic testing. In the first step of our approach, we extract system properties to be selected for MR selection from the functional and interface specifications of the sim… view at source ↗

**Figure 2.** Figure 2: Workflow of Extractor Agent 2) Extractor Agent: The Extractor Agent operates by transforming the input functional specification document through a structured conversion pipeline, then formulating an output in an LLM-friendly content format (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: MR generated by the agent in Given-When-Then format [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation-based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error-prone process. To address this challenge, we propose an LLM-powered multi-agent workflow for specification-based metamorphic testing of FMU-based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given-When-Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent LLM workflow for turning specs into Given-When-Then MRs for FMU testing is a practical engineering proposal, but the single-example evaluation supplies no data on relation quality or effort savings.

read the letter

The paper's main contribution is a workflow that chains several LLM agents to pull requirements from functional and interface specs, then produce metamorphic relations in Given-When-Then format for black-box FMU simulations. They run the generated tests across multiple sessions and check output consistency.

The setup targets a genuine industrial problem: FMI lets partners exchange models without source access, so conventional oracles are unavailable and manual MR extraction is slow. Framing the relations in GWT style and tying them to simulation execution is a clean way to make the output usable.

The evaluation stays at the level of a single Lube Oil Cooling FMU. The abstract calls the relations "meaningful" and says the workflow reduces manual effort, yet it gives no counts on extraction accuracy, fraction of relations that needed human fixes, inter-rater agreement with experts, or comparison against hand-written MRs. Without those numbers the central claim stays untested.

The stress-test note is on target here. If a non-trivial share of the generated relations turn out to be either unsound or too weak to expose faults, the promised reduction in manual work does not appear. The paper does not report any such measurement.

This is aimed at engineers who already work with FMI co-simulation and want to explore LLM assistance for test-oracle generation. A reader looking for a worked example of multi-agent prompting in a narrow domain might pick up the workflow description.

It is worth sending to peer review. The problem is clearly stated and the method is described at a level that lets others replicate or extend it, but any referee will need to see quantitative checks on the quality of the LLM output before the effectiveness claims can be taken seriously.

Referee Report

2 major / 1 minor

Summary. The paper proposes an LLM-powered multi-agent workflow that takes functional and interface specifications as input, extracts requirements via agents, derives metamorphic relations (MRs) expressed in Given-When-Then format, generates and executes metamorphic test cases on FMU simulations, and checks output consistency. It is demonstrated on a single Lube Oil Cooling system FMU, with the claim that preliminary results show the workflow automatically generates meaningful MRs, reduces manual effort, and supports systematic V&V of dynamic simulation models.

Significance. If the central claim holds under rigorous evaluation, the work would automate a key manual bottleneck in metamorphic testing for black-box FMI/FMU models, which are widely used in industrial co-simulation. The multi-agent orchestration for requirement extraction and GWT-structured MR derivation represents a practical engineering contribution that could be extended to other specification-driven testing domains.

major comments (2)

[Abstract] Abstract: The claim that the workflow 'automatically generate[s] meaningful MRs' and 'reduc[es] manual effort' rests on unmeasured extraction accuracy; no error rate, fraction of MRs needing revision, inter-rater agreement with human experts, or comparison to manually authored relations is reported, so the effectiveness for systematic V&V cannot be assessed from the single-example preliminary results.
[Evaluation] Evaluation section: The Lube Oil Cooling FMU demonstration supplies no quantitative metrics (e.g., fault-detection rate of generated MRs, baseline against manual MRs, or number of test cases executed), nor any validation method for MR soundness against the specification, leaving the improvement-in-test-generation claim unsupported.

minor comments (1)

[Abstract] Abstract: The phrase 'meaningful MRs' is used without an operational definition or criteria (e.g., executability, specificity, or fault-revealing power) that would allow readers to interpret the preliminary results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the workflow 'automatically generate[s] meaningful MRs' and 'reduc[es] manual effort' rests on unmeasured extraction accuracy; no error rate, fraction of MRs needing revision, inter-rater agreement with human experts, or comparison to manually authored relations is reported, so the effectiveness for systematic V&V cannot be assessed from the single-example preliminary results.

Authors: We agree that the abstract claims regarding automatic generation of meaningful MRs and reduction of manual effort are not supported by quantitative measures such as extraction accuracy, error rates, or comparisons to human-authored relations. The current work presents a preliminary demonstration on a single FMU. We will revise the abstract to qualify these claims, emphasizing the exploratory nature of the results and removing unsupported assertions about effectiveness for systematic V&V. revision: yes
Referee: [Evaluation] Evaluation section: The Lube Oil Cooling FMU demonstration supplies no quantitative metrics (e.g., fault-detection rate of generated MRs, baseline against manual MRs, or number of test cases executed), nor any validation method for MR soundness against the specification, leaving the improvement-in-test-generation claim unsupported.

Authors: The evaluation section is limited to a qualitative demonstration on one FMU without quantitative metrics such as fault-detection rates, baselines against manual MRs, or explicit validation of MR soundness. This accurately reflects the preliminary scope of the paper, which prioritizes workflow description over a controlled empirical study. We will revise the evaluation section to report concrete details including the number of MRs generated, test cases executed, and the manual process used to check consistency with the specification. We will also add an explicit limitations subsection and outline plans for future quantitative evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: applied workflow paper with no derivations or self-referential predictions

full rationale

The paper presents an engineering workflow for LLM-based extraction of metamorphic relations from specifications, followed by test generation and execution on an FMU example. No equations, fitted parameters, uniqueness theorems, or predictions appear in the provided text. The central claim is an empirical demonstration on a single Lube Oil Cooling FMU that the workflow 'automatically generate[s] meaningful MRs'; this is a direct report of observed output rather than a derivation that reduces to its own inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing steps. The paper is therefore self-contained as a descriptive applied-methods contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied method proposal in software engineering; the central claim rests on the unverified capability of LLMs to perform accurate extraction and relation derivation rather than on any mathematical constructs.

axioms (1)

domain assumption Large language models can reliably interpret functional and interface specifications and generate correct Given-When-Then metamorphic relations
The entire multi-agent workflow depends on this capability for the agents to produce usable MRs and test cases.

pith-pipeline@v0.9.1-grok · 5828 in / 1270 out tokens · 42103 ms · 2026-06-29T23:50:52.082419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages

[1]

Cederbladh et al

J. Cederbladh et al. Early validation and verification of system behaviour in model-based systems engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(3), 2024

2024
[2]

Blochwitz et al

T. Blochwitz et al. Functional mockup interface 2.0: The standard for tool independent exchange of simulation models. In9th international modelica conference, pp. 173–184. The Modelica Association, 2012

2012
[3]

T. Y . Chen et al. Metamorphic testing: a new approach for generating next test cases.arXiv preprint arXiv:2002.12543, 2020

work page arXiv 2002
[4]

Liu et al

H. Liu et al. A new method for constructing metamorphic relations. In 12th International Conference on Quality Software. IEEE, 2012

2012
[5]

Segura and Z

S. Segura and Z. Q. Zhou. Metamorphic testing 20 years later: A hands- on introduction. InProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, pp. 538–539, 2018

2018
[6]

N´u˜nez and R

A. N´u˜nez and R. M. Hierons. A methodology for validating cloud models using metamorphic testing.annals of telecommunications-annales des t´el´ecommunications, 70(3):127–135, 2015

2015
[7]

Lindvall et al

M. Lindvall et al. Metamorphic model-based testing of autonomous sys- tems. In2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), pp. 35–41. IEEE, 2017

2017
[8]

Olsen and M

M. Olsen and M. Raunak. Increasing validity of simulation models through metamorphic testing.IEEE Trans. on Reliability, 68(1), 2018

2018
[9]

Sudheerbabu et al

G. Sudheerbabu et al. Validation of dynamic simulation models using metamorphic testing and given-when-then patterns. InModelica Conferences, pp. 139–146, 2025

2025
[10]

Segura et al

S. Segura et al. A survey on metamorphic testing.IEEE Transactions on Software Engineering, 42(9):805–824, 2016

2016
[11]

T. Y . Chen et al. Metamorphic testing: A review of challenges and opportunities.ACM Computing Surveys (CSUR), 51(1):1–27, 2018

2018
[12]

Q. H. Luu et al. Can chatgpt advance software testing intelligence? an experience report on metamorphic testing.arXiv:2310.19204, 2023

work page arXiv 2023
[13]

Zhang et al

Y . Zhang et al. Automated metamorphic-relation generation with chatgpt. InProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1–6. IEEE, 2023

2023
[14]

S. Y . Shin et al. Towards generating executable metamorphic relations using large language models. InIntl. Conf. on the Quality of Information and Communications Technology, pp. 126–141. Springer, 2024

2024
[15]

Virtual Sea Trial Project

NoviaRDISeafaring. Virtual Sea Trial Project. https://github.com/ Novia-RDI-Seafaring/fmu-opc-hackathon/tree/main/fmus/loc, 2024

2024
[16]

Liang et al

L. Liang et al. AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems.arXiv preprint arXiv:2510.19438v1, 2025

work page arXiv 2025
[17]

Atil et al

B. Atil et al. Non-determinism of ”deterministic” llm settings.arXiv preprint arXiv:2408.04667, 2024

work page arXiv 2024
[18]

Wynne and A

M. Wynne and A. Hellesoy.”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012

2012
[19]

LangGraph

LangChain. LangGraph. https://reference.langchain.com/python/ langgraph/overview, 2024. Accessed: 2026-04-30

2024
[20]

DassaultSyst´emes. FMPy. https://fmpy.readthedocs.io/en/latest/, 2017. Accessed: 2026-04-21

2017

[1] [1]

Cederbladh et al

J. Cederbladh et al. Early validation and verification of system behaviour in model-based systems engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(3), 2024

2024

[2] [2]

Blochwitz et al

T. Blochwitz et al. Functional mockup interface 2.0: The standard for tool independent exchange of simulation models. In9th international modelica conference, pp. 173–184. The Modelica Association, 2012

2012

[3] [3]

T. Y . Chen et al. Metamorphic testing: a new approach for generating next test cases.arXiv preprint arXiv:2002.12543, 2020

work page arXiv 2002

[4] [4]

Liu et al

H. Liu et al. A new method for constructing metamorphic relations. In 12th International Conference on Quality Software. IEEE, 2012

2012

[5] [5]

Segura and Z

S. Segura and Z. Q. Zhou. Metamorphic testing 20 years later: A hands- on introduction. InProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, pp. 538–539, 2018

2018

[6] [6]

N´u˜nez and R

A. N´u˜nez and R. M. Hierons. A methodology for validating cloud models using metamorphic testing.annals of telecommunications-annales des t´el´ecommunications, 70(3):127–135, 2015

2015

[7] [7]

Lindvall et al

M. Lindvall et al. Metamorphic model-based testing of autonomous sys- tems. In2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), pp. 35–41. IEEE, 2017

2017

[8] [8]

Olsen and M

M. Olsen and M. Raunak. Increasing validity of simulation models through metamorphic testing.IEEE Trans. on Reliability, 68(1), 2018

2018

[9] [9]

Sudheerbabu et al

G. Sudheerbabu et al. Validation of dynamic simulation models using metamorphic testing and given-when-then patterns. InModelica Conferences, pp. 139–146, 2025

2025

[10] [10]

Segura et al

S. Segura et al. A survey on metamorphic testing.IEEE Transactions on Software Engineering, 42(9):805–824, 2016

2016

[11] [11]

T. Y . Chen et al. Metamorphic testing: A review of challenges and opportunities.ACM Computing Surveys (CSUR), 51(1):1–27, 2018

2018

[12] [12]

Q. H. Luu et al. Can chatgpt advance software testing intelligence? an experience report on metamorphic testing.arXiv:2310.19204, 2023

work page arXiv 2023

[13] [13]

Zhang et al

Y . Zhang et al. Automated metamorphic-relation generation with chatgpt. InProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1–6. IEEE, 2023

2023

[14] [14]

S. Y . Shin et al. Towards generating executable metamorphic relations using large language models. InIntl. Conf. on the Quality of Information and Communications Technology, pp. 126–141. Springer, 2024

2024

[15] [15]

Virtual Sea Trial Project

NoviaRDISeafaring. Virtual Sea Trial Project. https://github.com/ Novia-RDI-Seafaring/fmu-opc-hackathon/tree/main/fmus/loc, 2024

2024

[16] [16]

Liang et al

L. Liang et al. AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems.arXiv preprint arXiv:2510.19438v1, 2025

work page arXiv 2025

[17] [17]

Atil et al

B. Atil et al. Non-determinism of ”deterministic” llm settings.arXiv preprint arXiv:2408.04667, 2024

work page arXiv 2024

[18] [18]

Wynne and A

M. Wynne and A. Hellesoy.”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012

2012

[19] [19]

LangGraph

LangChain. LangGraph. https://reference.langchain.com/python/ langgraph/overview, 2024. Accessed: 2026-04-30

2024

[20] [20]

DassaultSyst´emes. FMPy. https://fmpy.readthedocs.io/en/latest/, 2017. Accessed: 2026-04-21

2017