pith. sign in

arxiv: 2605.25101 · v1 · pith:T3OCOWKLnew · submitted 2026-05-24 · 💻 cs.SE · cs.AI· cs.SY· eess.SY

Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

Pith reviewed 2026-06-29 23:50 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.SYeess.SY
keywords metamorphic testingFMUFMImulti-agent systemsLLMsimulation testingtest oraclesGiven-When-Then
0
0 comments X

The pith

A multi-agent LLM workflow derives Given-When-Then metamorphic relations from FMU specifications to generate tests for simulation models lacking explicit oracles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an automated workflow where multiple LLM agents process functional and interface specifications of FMU-based simulations to extract requirements and create metamorphic relations. These relations are formatted in Given-When-Then style to define input conditions, transformations, and expected behaviors, enabling test case generation and execution without traditional oracles. Evaluation on a Lube Oil Cooling system FMU shows the system can produce meaningful relations and tests, supporting verification by reducing manual effort in handling the oracle problem for dynamic models. This matters because FMI allows model exchange across tools, but testing remains hard without expected outputs. The approach addresses the manual and error-prone nature of extracting metamorphic relations manually.

Core claim

The central claim is that an LLM-powered multi-agent system can take functional and interface specifications as input, orchestrate agents to extract requirements and derive metamorphic relations in Given-When-Then patterns, use these to generate and execute metamorphic test cases on FMUs, and evaluate output consistency, thereby enabling systematic verification and validation of simulation models.

What carries the argument

The multi-agent LLM workflow that extracts requirements from specifications and derives metamorphic relations expressed using Given-When-Then patterns for structuring input conditions, transformations, and expected behaviors.

If this is right

  • Automatically generated MRs can be used to create metamorphic test cases for FMU simulations.
  • The workflow evaluates consistency of outputs across multiple simulation sessions.
  • It supports verification of dynamic simulation models exchanged via FMI.
  • Preliminary evaluation indicates reduction in manual effort for test generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could extend to other types of simulation models beyond FMUs if the specification extraction works similarly.
  • Integrating human review steps might improve accuracy of the derived relations.
  • Applying the workflow to additional industrial FMU examples would test its generalizability.

Load-bearing premise

The multi-agent LLM system can accurately extract requirements from specifications and derive valid, meaningful metamorphic relations in Given-When-Then format without substantial human correction.

What would settle it

A demonstration that many of the automatically derived metamorphic relations for the Lube Oil Cooling system FMU are invalid or do not hold when checked against actual simulation behavior would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.25101 by Abdullah Mughees, Ashir Kulshreshtha, Dragos Truscan, Gaadha Sudheerbabu, Kristian Klemets, Mikael Manng{\aa}rd, Tanwir Ahmad.

Figure 1
Figure 1. Figure 1: Overview of the Multi-Agent workflow The approach takes as input the simulation model, packed as an FMI-standard compliant FMU, which comprises the functional specification, model description and execution binaries to facilitate the metamorphic testing. In the first step of our approach, we extract system properties to be selected for MR selection from the functional and interface specifications of the sim… view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of Extractor Agent 2) Extractor Agent: The Extractor Agent operates by trans￾forming the input functional specification document through a structured conversion pipeline, then formulating an output in an LLM-friendly content format (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MR generated by the agent in Given-When-Then format [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs) across different partners using various modelling tools. This opens up the possibilities for simulation-based verification and validation using FMUs for ensuring reliable system behaviour. However, deriving effective test oracles for these simulation models remains challenging due to the absence of explicit expected outputs. This limits the applicability of conventional testing approaches, which require access to the internal workings of the systems. Metamorphic testing (MT) addresses this limitation by leveraging metamorphic relations (MRs), but extracting such relations from specifications remains largely a manual and error-prone process. To address this challenge, we propose an LLM-powered multi-agent workflow for specification-based metamorphic testing of FMU-based simulation models. The approach takes functional and interface specifications as input and orchestrates multiple agents to extract requirements and derive MRs. These MRs are expressed using Given-When-Then patterns to structure input conditions (Given), transformations (When), and expected output behaviours (Then). These relations are then used to generate metamorphic test cases, execute simulations, and evaluate output consistency across multiple sessions. We evaluate the approach on a Lube Oil Cooling system FMU, demonstrating its ability to automatically generate meaningful MRs and corresponding test cases. Preliminary results indicate that the proposed workflow can effectively support the systematic verification and validation of dynamic simulation models by reducing manual effort and improving test generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an LLM-powered multi-agent workflow that takes functional and interface specifications as input, extracts requirements via agents, derives metamorphic relations (MRs) expressed in Given-When-Then format, generates and executes metamorphic test cases on FMU simulations, and checks output consistency. It is demonstrated on a single Lube Oil Cooling system FMU, with the claim that preliminary results show the workflow automatically generates meaningful MRs, reduces manual effort, and supports systematic V&V of dynamic simulation models.

Significance. If the central claim holds under rigorous evaluation, the work would automate a key manual bottleneck in metamorphic testing for black-box FMI/FMU models, which are widely used in industrial co-simulation. The multi-agent orchestration for requirement extraction and GWT-structured MR derivation represents a practical engineering contribution that could be extended to other specification-driven testing domains.

major comments (2)
  1. [Abstract] Abstract: The claim that the workflow 'automatically generate[s] meaningful MRs' and 'reduc[es] manual effort' rests on unmeasured extraction accuracy; no error rate, fraction of MRs needing revision, inter-rater agreement with human experts, or comparison to manually authored relations is reported, so the effectiveness for systematic V&V cannot be assessed from the single-example preliminary results.
  2. [Evaluation] Evaluation section: The Lube Oil Cooling FMU demonstration supplies no quantitative metrics (e.g., fault-detection rate of generated MRs, baseline against manual MRs, or number of test cases executed), nor any validation method for MR soundness against the specification, leaving the improvement-in-test-generation claim unsupported.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'meaningful MRs' is used without an operational definition or criteria (e.g., executability, specificity, or fault-revealing power) that would allow readers to interpret the preliminary results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the workflow 'automatically generate[s] meaningful MRs' and 'reduc[es] manual effort' rests on unmeasured extraction accuracy; no error rate, fraction of MRs needing revision, inter-rater agreement with human experts, or comparison to manually authored relations is reported, so the effectiveness for systematic V&V cannot be assessed from the single-example preliminary results.

    Authors: We agree that the abstract claims regarding automatic generation of meaningful MRs and reduction of manual effort are not supported by quantitative measures such as extraction accuracy, error rates, or comparisons to human-authored relations. The current work presents a preliminary demonstration on a single FMU. We will revise the abstract to qualify these claims, emphasizing the exploratory nature of the results and removing unsupported assertions about effectiveness for systematic V&V. revision: yes

  2. Referee: [Evaluation] Evaluation section: The Lube Oil Cooling FMU demonstration supplies no quantitative metrics (e.g., fault-detection rate of generated MRs, baseline against manual MRs, or number of test cases executed), nor any validation method for MR soundness against the specification, leaving the improvement-in-test-generation claim unsupported.

    Authors: The evaluation section is limited to a qualitative demonstration on one FMU without quantitative metrics such as fault-detection rates, baselines against manual MRs, or explicit validation of MR soundness. This accurately reflects the preliminary scope of the paper, which prioritizes workflow description over a controlled empirical study. We will revise the evaluation section to report concrete details including the number of MRs generated, test cases executed, and the manual process used to check consistency with the specification. We will also add an explicit limitations subsection and outline plans for future quantitative evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: applied workflow paper with no derivations or self-referential predictions

full rationale

The paper presents an engineering workflow for LLM-based extraction of metamorphic relations from specifications, followed by test generation and execution on an FMU example. No equations, fitted parameters, uniqueness theorems, or predictions appear in the provided text. The central claim is an empirical demonstration on a single Lube Oil Cooling FMU that the workflow 'automatically generate[s] meaningful MRs'; this is a direct report of observed output rather than a derivation that reduces to its own inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing steps. The paper is therefore self-contained as a descriptive applied-methods contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied method proposal in software engineering; the central claim rests on the unverified capability of LLMs to perform accurate extraction and relation derivation rather than on any mathematical constructs.

axioms (1)
  • domain assumption Large language models can reliably interpret functional and interface specifications and generate correct Given-When-Then metamorphic relations
    The entire multi-agent workflow depends on this capability for the agents to produce usable MRs and test cases.

pith-pipeline@v0.9.1-grok · 5828 in / 1270 out tokens · 42103 ms · 2026-06-29T23:50:52.082419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages

  1. [1]

    Cederbladh et al

    J. Cederbladh et al. Early validation and verification of system behaviour in model-based systems engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(3), 2024

  2. [2]

    Blochwitz et al

    T. Blochwitz et al. Functional mockup interface 2.0: The standard for tool independent exchange of simulation models. In9th international modelica conference, pp. 173–184. The Modelica Association, 2012

  3. [3]

    T. Y . Chen et al. Metamorphic testing: a new approach for generating next test cases.arXiv preprint arXiv:2002.12543, 2020

  4. [4]

    Liu et al

    H. Liu et al. A new method for constructing metamorphic relations. In 12th International Conference on Quality Software. IEEE, 2012

  5. [5]

    Segura and Z

    S. Segura and Z. Q. Zhou. Metamorphic testing 20 years later: A hands- on introduction. InProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, pp. 538–539, 2018

  6. [6]

    N´u˜nez and R

    A. N´u˜nez and R. M. Hierons. A methodology for validating cloud models using metamorphic testing.annals of telecommunications-annales des t´el´ecommunications, 70(3):127–135, 2015

  7. [7]

    Lindvall et al

    M. Lindvall et al. Metamorphic model-based testing of autonomous sys- tems. In2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), pp. 35–41. IEEE, 2017

  8. [8]

    Olsen and M

    M. Olsen and M. Raunak. Increasing validity of simulation models through metamorphic testing.IEEE Trans. on Reliability, 68(1), 2018

  9. [9]

    Sudheerbabu et al

    G. Sudheerbabu et al. Validation of dynamic simulation models using metamorphic testing and given-when-then patterns. InModelica Conferences, pp. 139–146, 2025

  10. [10]

    Segura et al

    S. Segura et al. A survey on metamorphic testing.IEEE Transactions on Software Engineering, 42(9):805–824, 2016

  11. [11]

    T. Y . Chen et al. Metamorphic testing: A review of challenges and opportunities.ACM Computing Surveys (CSUR), 51(1):1–27, 2018

  12. [12]

    Q. H. Luu et al. Can chatgpt advance software testing intelligence? an experience report on metamorphic testing.arXiv:2310.19204, 2023

  13. [13]

    Zhang et al

    Y . Zhang et al. Automated metamorphic-relation generation with chatgpt. InProceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1–6. IEEE, 2023

  14. [14]

    S. Y . Shin et al. Towards generating executable metamorphic relations using large language models. InIntl. Conf. on the Quality of Information and Communications Technology, pp. 126–141. Springer, 2024

  15. [15]

    Virtual Sea Trial Project

    NoviaRDISeafaring. Virtual Sea Trial Project. https://github.com/ Novia-RDI-Seafaring/fmu-opc-hackathon/tree/main/fmus/loc, 2024

  16. [16]

    Liang et al

    L. Liang et al. AutoMT: A Multi-Agent LLM Framework for Automated Metamorphic Testing of Autonomous Driving Systems.arXiv preprint arXiv:2510.19438v1, 2025

  17. [17]

    Atil et al

    B. Atil et al. Non-determinism of ”deterministic” llm settings.arXiv preprint arXiv:2408.04667, 2024

  18. [18]

    Wynne and A

    M. Wynne and A. Hellesoy.”The cucumber book: behaviour-driven development for testers and developers”. Pragmatic Bookshelf, 2012

  19. [19]

    LangGraph

    LangChain. LangGraph. https://reference.langchain.com/python/ langgraph/overview, 2024. Accessed: 2026-04-30

  20. [20]

    DassaultSyst´emes. FMPy. https://fmpy.readthedocs.io/en/latest/, 2017. Accessed: 2026-04-21