Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations
Pith reviewed 2026-06-29 23:50 UTC · model grok-4.3
The pith
A multi-agent LLM workflow derives Given-When-Then metamorphic relations from FMU specifications to generate tests for simulation models lacking explicit oracles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LLM-powered multi-agent system can take functional and interface specifications as input, orchestrate agents to extract requirements and derive metamorphic relations in Given-When-Then patterns, use these to generate and execute metamorphic test cases on FMUs, and evaluate output consistency, thereby enabling systematic verification and validation of simulation models.
What carries the argument
The multi-agent LLM workflow that extracts requirements from specifications and derives metamorphic relations expressed using Given-When-Then patterns for structuring input conditions, transformations, and expected behaviors.
If this is right
- Automatically generated MRs can be used to create metamorphic test cases for FMU simulations.
- The workflow evaluates consistency of outputs across multiple simulation sessions.
- It supports verification of dynamic simulation models exchanged via FMI.
- Preliminary evaluation indicates reduction in manual effort for test generation.
Where Pith is reading between the lines
- This method could extend to other types of simulation models beyond FMUs if the specification extraction works similarly.
- Integrating human review steps might improve accuracy of the derived relations.
- Applying the workflow to additional industrial FMU examples would test its generalizability.
Load-bearing premise
The multi-agent LLM system can accurately extract requirements from specifications and derive valid, meaningful metamorphic relations in Given-When-Then format without substantial human correction.
What would settle it
A demonstration that many of the automatically derived metamorphic relations for the Lube Oil Cooling system FMU are invalid or do not hold when checked against actual simulation behavior would falsify the claim.
Figures
read the original abstract
In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation-based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error-prone process. To address this challenge, we propose an LLM-powered multi-agent workflow for specification-based metamorphic testing of FMU-based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given-When-Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an LLM-powered multi-agent workflow that takes functional and interface specifications as input, extracts requirements via agents, derives metamorphic relations (MRs) expressed in Given-When-Then format, generates and executes metamorphic test cases on FMU simulations, and checks output consistency. It is demonstrated on a single Lube Oil Cooling system FMU, with the claim that preliminary results show the workflow automatically generates meaningful MRs, reduces manual effort, and supports systematic V&V of dynamic simulation models.
Significance. If the central claim holds under rigorous evaluation, the work would automate a key manual bottleneck in metamorphic testing for black-box FMI/FMU models, which are widely used in industrial co-simulation. The multi-agent orchestration for requirement extraction and GWT-structured MR derivation represents a practical engineering contribution that could be extended to other specification-driven testing domains.
major comments (2)
- [Abstract] Abstract: The claim that the workflow 'automatically generate[s] meaningful MRs' and 'reduc[es] manual effort' rests on unmeasured extraction accuracy; no error rate, fraction of MRs needing revision, inter-rater agreement with human experts, or comparison to manually authored relations is reported, so the effectiveness for systematic V&V cannot be assessed from the single-example preliminary results.
- [Evaluation] Evaluation section: The Lube Oil Cooling FMU demonstration supplies no quantitative metrics (e.g., fault-detection rate of generated MRs, baseline against manual MRs, or number of test cases executed), nor any validation method for MR soundness against the specification, leaving the improvement-in-test-generation claim unsupported.
minor comments (1)
- [Abstract] Abstract: The phrase 'meaningful MRs' is used without an operational definition or criteria (e.g., executability, specificity, or fault-revealing power) that would allow readers to interpret the preliminary results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger empirical support. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the workflow 'automatically generate[s] meaningful MRs' and 'reduc[es] manual effort' rests on unmeasured extraction accuracy; no error rate, fraction of MRs needing revision, inter-rater agreement with human experts, or comparison to manually authored relations is reported, so the effectiveness for systematic V&V cannot be assessed from the single-example preliminary results.
Authors: We agree that the abstract claims regarding automatic generation of meaningful MRs and reduction of manual effort are not supported by quantitative measures such as extraction accuracy, error rates, or comparisons to human-authored relations. The current work presents a preliminary demonstration on a single FMU. We will revise the abstract to qualify these claims, emphasizing the exploratory nature of the results and removing unsupported assertions about effectiveness for systematic V&V. revision: yes
-
Referee: [Evaluation] Evaluation section: The Lube Oil Cooling FMU demonstration supplies no quantitative metrics (e.g., fault-detection rate of generated MRs, baseline against manual MRs, or number of test cases executed), nor any validation method for MR soundness against the specification, leaving the improvement-in-test-generation claim unsupported.
Authors: The evaluation section is limited to a qualitative demonstration on one FMU without quantitative metrics such as fault-detection rates, baselines against manual MRs, or explicit validation of MR soundness. This accurately reflects the preliminary scope of the paper, which prioritizes workflow description over a controlled empirical study. We will revise the evaluation section to report concrete details including the number of MRs generated, test cases executed, and the manual process used to check consistency with the specification. We will also add an explicit limitations subsection and outline plans for future quantitative evaluation. revision: partial
Circularity Check
No circularity: applied workflow paper with no derivations or self-referential predictions
full rationale
The paper presents an engineering workflow for LLM-based extraction of metamorphic relations from specifications, followed by test generation and execution on an FMU example. No equations, fitted parameters, uniqueness theorems, or predictions appear in the provided text. The central claim is an empirical demonstration on a single Lube Oil Cooling FMU that the workflow 'automatically generate[s] meaningful MRs'; this is a direct report of observed output rather than a derivation that reduces to its own inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing steps. The paper is therefore self-contained as a descriptive applied-methods contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can reliably interpret functional and interface specifications and generate correct Given-When-Then metamorphic relations
Reference graph
Works this paper leans on
-
[1]
Cederbladh et al
J. Cederbladh et al. Early validation and verification of system behaviour in model-based systems engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(3), 2024
2024
-
[2]
Blochwitz et al
T. Blochwitz et al. Functional mockup interface 2.0: The standard for tool independent exchange of simulation models. In9th international modelica conference, pp. 173–184. The Modelica Association, 2012
2012
- [3]
-
[4]
Liu et al
H. Liu et al. A new method for constructing metamorphic relations. In 12th International Conference on Quality Software. IEEE, 2012
2012
-
[5]
Segura and Z
S. Segura and Z. Q. Zhou. Metamorphic testing 20 years later: A hands- on introduction. InProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, pp. 538–539, 2018
2018
-
[6]
N´u˜nez and R
A. N´u˜nez and R. M. Hierons. A methodology for validating cloud models using metamorphic testing.annals of telecommunications-annales des t´el´ecommunications, 70(3):127–135, 2015
2015
-
[7]
Lindvall et al
M. Lindvall et al. Metamorphic model-based testing of autonomous sys- tems. In2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), pp. 35–41. IEEE, 2017
2017
-
[8]
Olsen and M
M. Olsen and M. Raunak. Increasing validity of simulation models through metamorphic testing.IEEE Trans. on Reliability, 68(1), 2018
2018
-
[9]
Sudheerbabu et al
G. Sudheerbabu et al. Validation of dynamic simulation models using metamorphic testing and given-when-then patterns. InModelica Conferences, pp. 139–146, 2025
2025
-
[10]
Segura et al
S. Segura et al. A survey on metamorphic testing.IEEE Transactions on Software Engineering, 42(9):805–824, 2016
2016
-
[11]
T. Y . Chen et al. Metamorphic testing: A review of challenges and opportunities.ACM Computing Surveys (CSUR), 51(1):1–27, 2018
2018
- [12]
-
[13]
Zhang et al
Y . Zhang et al. Automated metamorphic-relation generation with chatgpt. InProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1–6. IEEE, 2023
2023
-
[14]
S. Y . Shin et al. Towards generating executable metamorphic relations using large language models. InIntl. Conf. on the Quality of Information and Communications Technology, pp. 126–141. Springer, 2024
2024
-
[15]
Virtual Sea Trial Project
NoviaRDISeafaring. Virtual Sea Trial Project. https://github.com/ Novia-RDI-Seafaring/fmu-opc-hackathon/tree/main/fmus/loc, 2024
2024
-
[16]
L. Liang et al. AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems.arXiv preprint arXiv:2510.19438v1, 2025
-
[17]
B. Atil et al. Non-determinism of ”deterministic” llm settings.arXiv preprint arXiv:2408.04667, 2024
-
[18]
Wynne and A
M. Wynne and A. Hellesoy.”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012
2012
-
[19]
LangGraph
LangChain. LangGraph. https://reference.langchain.com/python/ langgraph/overview, 2024. Accessed: 2026-04-30
2024
-
[20]
DassaultSyst´emes. FMPy. https://fmpy.readthedocs.io/en/latest/, 2017. Accessed: 2026-04-21
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.