Using Large Language Models for Black-Box Testing of FMU-Based Simulations

Abdullah Mughees; Dragos Truscan; Gaadha Sudheerbabu; Kristian Klemets; Mikael Manng{\aa}rd; Tanwir Ahmad

arxiv: 2604.25650 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.SY· eess.SY

Using Large Language Models for Black-Box Testing of FMU-Based Simulations

Abdullah Mughees , Gaadha Sudheerbabu , Tanwir Ahmad , Dragos Truscan , Mikael Manng{\aa}rd , Kristian Klemets This is my paper

Pith reviewed 2026-05-07 15:57 UTC · model grok-4.3

classification 💻 cs.SE cs.SYeess.SY

keywords black-box testinglarge language modelsFMUscenario generationdynamic simulationtest automationGiven-When-Then

0 comments

The pith

LLMs generate Given-When-Then test scenarios from FMU specifications to automate black-box verification of dynamic simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a method that feeds the functional and interface specifications of an FMU into an LLM to produce structured scenario goals in Given-When-Then form. Each goal defines starting input conditions, a change to those inputs, and the output behavior that should result. The method then turns the goals into concrete input time series and output assertions, runs the simulation, checks the assertions, and returns readable logs plus plots that show pass rates and per-goal outcomes. A human reviews the generated material before it is stored for reuse. If the method holds, engineers would spend less time hand-crafting test cases while gaining clearer evidence that the simulation behaves as intended under input changes.

Core claim

Prompting an LLM with FMU specifications produces valid Given-When-Then scenario goals together with input patterns and assertion oracles. These are assembled into executable plans that generate full input time series, drive the simulation, and evaluate the recorded outputs against the expected patterns. The resulting logs, statistics, and overlaid plots allow direct inspection of whether the model satisfies the stated goals. The process is presented as a practical human-in-the-loop workflow that stores scenarios and results for later re-execution.

What carries the argument

The LLM-prompting pipeline that converts FMU functional and interface specifications into Given-When-Then goals, derives input patterns and output assertions from those goals, executes the simulation, and produces human-readable result logs and plots.

If this is right

Test scenario creation for dynamic models shifts from fully manual to LLM-assisted with human review.
Verification produces both aggregate pass rates and per-goal visual overlays that make results easier to interpret.
Generated scenarios and their results can be stored and replayed for regression or repeated checks.
The workflow applies to everyday simulation models once the LLM output is inspected for validity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting structure could be tried on simulation packages that expose similar interface descriptions even if they are not packaged as FMUs.
Stored scenario sets could serve as a growing library of regression tests that grows with each new model version.
Prompt variations or different LLM choices could be compared on the same FMU to measure how stable the generated goals remain.
Integration with automated build systems would let the verification step run whenever an FMU is updated.

Load-bearing premise

The LLM must correctly read the FMU specifications and produce goals plus assertions that accurately describe the expected behavior under input changes.

What would settle it

Generate scenarios for an FMU, run them, and observe that many assertions are ill-formed or that important input changes produce no corresponding output checks.

Figures

Figures reproduced from arXiv: 2604.25650 by Abdullah Mughees, Dragos Truscan, Gaadha Sudheerbabu, Kristian Klemets, Mikael Manng{\aa}rd, Tanwir Ahmad.

**Figure 1.** Figure 1: Overview of the Scenario Generation Approach view at source ↗

**Figure 2.** Figure 2: Mutation Generation Approach manual approach, which were different both in terms of plans and goals. 3) Scenario Adequacy: To evaluate the quality of the generated scenarios, we apply mutation analysis. Since we do not have access to the internal specification of the FMU, we imitate possible design and implementation mistakes by applying systematic changes (mutations) in the outputs (see view at source ↗

read the original abstract

We propose a human in the loop approach for black-box testing of Functional Mock-up Units (FMUs) using Large Language Models (LLMs). The goal is to reduce the manual effort in defining test scenarios for dynamic simulation models and to improve the interpretability of results. The approach takes the functional and interface specifications of an FMU as input, and prompts an LLM to generate structured scenario goals in Given-When-Then format that define the initial input conditions of the simulation, a possible change in those conditions, and the expected output behaviour of the system against those changes. The corresponding scenario plans specify input patterns and add assertion oracles that describe expected output patterns defined in scenario goals. The approach generates a complete input time series for the scenario plans, runs the FMU simulation, and evaluates assertions on the recorded outputs. It produces human-readable logs and plots that show statistics for each scenario with overlays, aggregate pass rates, and per-goal outcomes. The generated scenarios and results are stored for evaluation and later re-execution. We evaluate the approach on a Lube Oil Cooling system and discuss design choices that make the approach practical for everyday use. Results suggest that LLM-assisted scenario generation can facilitate automatic test design and verification of dynamic simulation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lays out a clear human-in-the-loop LLM workflow for generating GWT test scenarios and assertions for FMU simulations but gives almost no numbers on how well it actually works.

read the letter

The paper's main offering is a pipeline that feeds FMU functional and interface specs to an LLM, asks it to produce Given-When-Then goals, turns those into input time series plus assertion oracles, runs the simulation, checks the outputs, and generates plots, logs, and pass-rate summaries. They ran it once on a Lube Oil Cooling system and stored everything for re-use. That structure is new in its combination for standardized FMU black-box testing, and the emphasis on human-readable results and re-execution is a practical touch for engineers who already work with FMI tools.

Referee Report

2 major / 2 minor

Summary. The paper proposes a human-in-the-loop workflow that uses LLMs to generate Given-When-Then scenario goals and assertions from FMU functional and interface specifications, produces input time series, executes the FMU, evaluates assertions on outputs, and generates logs, plots, and aggregate pass rates. The method is demonstrated on a single Lube Oil Cooling system FMU, with the claim that LLM-assisted scenario generation can facilitate automatic test design and verification of dynamic simulation models.

Significance. If the generated assertions prove reliable with limited human oversight, the approach could reduce the substantial manual effort currently required to define test scenarios and oracles for FMU-based simulations, while improving result interpretability through structured outputs and visualizations. The human-in-the-loop design is a pragmatic strength for engineering practice.

major comments (2)

[Evaluation] Evaluation section: The manuscript reports an evaluation on the Lube Oil Cooling system but provides no quantitative metrics whatsoever—no counts of generated scenarios, no fraction requiring human correction, no accuracy or agreement rate between LLM-generated assertions and intended behavior, no pass/fail statistics beyond the qualitative mention of 'aggregate pass rates,' and no baseline comparison to manual test design. This absence directly undermines the central claim that the approach 'facilitates automatic test design.'
[Approach] Approach and abstract: The core assumption that an LLM can reliably translate FMU specifications into valid Given-When-Then goals and assertions that correctly capture expected dynamic output behavior under input changes is stated but never tested or quantified. Without reported validation (e.g., expert review of oracle correctness or failure-mode analysis), the human-in-the-loop workflow may still demand substantial manual verification, weakening the assertion of reduced effort.

minor comments (2)

[Approach] The paper would benefit from a concrete example (in text or a figure) showing one generated Given-When-Then goal, the corresponding assertion, the input time series, and the resulting plot with pass/fail overlay.
[Abstract] Clarify in the abstract and conclusion what specific observations from the Lube Oil Cooling case study support the suggestion that the method facilitates test design, rather than leaving it as a high-level statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments. We believe the suggested revisions will improve the manuscript's clarity and rigor. We address each major comment below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The manuscript reports an evaluation on the Lube Oil Cooling system but provides no quantitative metrics whatsoever—no counts of generated scenarios, no fraction requiring human correction, no accuracy or agreement rate between LLM-generated assertions and intended behavior, no pass/fail statistics beyond the qualitative mention of 'aggregate pass rates,' and no baseline comparison to manual test design. This absence directly undermines the central claim that the approach 'facilitates automatic test design.'

Authors: We agree that the evaluation lacks quantitative metrics, which limits the strength of our claims regarding reduced manual effort. The current manuscript presents a proof-of-concept demonstration on the Lube Oil Cooling system, with results described qualitatively through logs, plots, and aggregate pass rates. In the revised manuscript, we will expand the evaluation section to include specific counts of generated scenarios, the proportion that required human intervention or correction, accuracy measures based on expert validation of assertions, detailed pass/fail statistics, and, where possible, a comparison to traditional manual test design efforts. This will provide a more robust basis for the claim that the approach facilitates automatic test design. revision: yes
Referee: [Approach] Approach and abstract: The core assumption that an LLM can reliably translate FMU specifications into valid Given-When-Then goals and assertions that correctly capture expected dynamic output behavior under input changes is stated but never tested or quantified. Without reported validation (e.g., expert review of oracle correctness or failure-mode analysis), the human-in-the-loop workflow may still demand substantial manual verification, weakening the assertion of reduced effort.

Authors: The human-in-the-loop design is intended to mitigate risks associated with LLM-generated content by incorporating human oversight. However, we acknowledge that the manuscript does not quantify the reliability of the LLM outputs or the extent of human effort required. We will revise the approach description and evaluation to include details on the validation process, such as the number of scenarios reviewed by experts, the rate of corrections needed, and an analysis of failure modes. This will better demonstrate the practical benefits in terms of effort reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural workflow with no derivations or self-referential claims

full rationale

The paper presents a human-in-the-loop engineering workflow for LLM-assisted generation of Given-When-Then scenarios and assertions from FMU specifications, followed by simulation execution and result logging. No mathematical equations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described method. The central claim is that the approach facilitates test design; this is supported by a descriptive evaluation on one system rather than any derivation that reduces to its own inputs by construction. No self-citations are invoked as load-bearing for any formal result. The method is self-contained as an applied procedure without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied software engineering paper describing a testing methodology. It introduces no new mathematical free parameters, unproven axioms, or postulated entities beyond standard use of LLMs and existing FMU simulation tools.

pith-pipeline@v0.9.0 · 5549 in / 1181 out tokens · 56766 ms · 2026-05-07T15:57:27.833407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Functional Mock-up Interface,

Modelica, “Functional Mock-up Interface,” https://fmi-standard.org/, 2022

work page 2022
[2]

Test generation and test prioritization for simulink models with dynamic behavior,

R. Matinnejadet al., “Test generation and test prioritization for simulink models with dynamic behavior,”IEEE Transactions on Soft- ware Engineering, vol. 45, no. 9, pp. 919–944, 2018

work page 2018
[3]

Test intention guided llm-based unit test generation,

Z. Nanet al., “Test intention guided llm-based unit test generation,” in2025 IEEE/ACM 47th International Conference on Software Engi- neering (ICSE). IEEE Computer Society, 2025, pp. 779–779

work page 2025
[4]

Aster: Natural and multi-language unit test generation with llms,

R. Panet al., “Aster: Natural and multi-language unit test generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2025, pp. 413–424

work page 2025
[5]

Automated unit test improvement using large language models at Meta,

N. Alshahwanet al., “Automated unit test improvement using large language models at Meta,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engi- neering, 2024, pp. 185–196

work page 2024
[6]

Software testing with large language models: Survey, landscape, and vision,

J. Wanget al., “Software testing with large language models: Survey, landscape, and vision,”IEEE Transactions on Software Engineering, vol. 50, no. 4, pp. 911–936, 2024

work page 2024
[7]

Prompt engineering in large language models,

G. Marvinet al., “Prompt engineering in large language models,” in Data Intelligence and Cognitive Informatics, I. J. Jacob, S. Piramuthu, and P. Falkowski-Gilski, Eds. Singapore: Springer Nature Singapore, 2024, pp. 387–402

work page 2024
[8]

Chain-of-thought prompting elicits reasoning in large language models,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[9]

Wynne and A

M. Wynne and A. Hellesoy,The cucumber book: behaviour-driven development for testers and developers. Pragmatic Bookshelf, 2012

work page 2012
[10]

Validation of dynamic simulation models using metamorphic testing and given-when-then patterns,

G. Sudheerbabuet al., “Validation of dynamic simulation models using metamorphic testing and given-when-then patterns,” inModelica Conferences, 2025, pp. 139–146

work page 2025
[11]

The generalization of latin hypercube sampling,

M. D. Shields and J. Zhang, “The generalization of latin hypercube sampling,”Reliability Engineering & System Safety, vol. 148, pp. 96– 108, 2016

work page 2016
[12]

DassaultSystémes, “FMPy,” https://fmpy.readthedocs.io/en/latest/, 2017

work page 2017

[1] [1]

Functional Mock-up Interface,

Modelica, “Functional Mock-up Interface,” https://fmi-standard.org/, 2022

work page 2022

[2] [2]

Test generation and test prioritization for simulink models with dynamic behavior,

R. Matinnejadet al., “Test generation and test prioritization for simulink models with dynamic behavior,”IEEE Transactions on Soft- ware Engineering, vol. 45, no. 9, pp. 919–944, 2018

work page 2018

[3] [3]

Test intention guided llm-based unit test generation,

Z. Nanet al., “Test intention guided llm-based unit test generation,” in2025 IEEE/ACM 47th International Conference on Software Engi- neering (ICSE). IEEE Computer Society, 2025, pp. 779–779

work page 2025

[4] [4]

Aster: Natural and multi-language unit test generation with llms,

R. Panet al., “Aster: Natural and multi-language unit test generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2025, pp. 413–424

work page 2025

[5] [5]

Automated unit test improvement using large language models at Meta,

N. Alshahwanet al., “Automated unit test improvement using large language models at Meta,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engi- neering, 2024, pp. 185–196

work page 2024

[6] [6]

Software testing with large language models: Survey, landscape, and vision,

J. Wanget al., “Software testing with large language models: Survey, landscape, and vision,”IEEE Transactions on Software Engineering, vol. 50, no. 4, pp. 911–936, 2024

work page 2024

[7] [7]

Prompt engineering in large language models,

G. Marvinet al., “Prompt engineering in large language models,” in Data Intelligence and Cognitive Informatics, I. J. Jacob, S. Piramuthu, and P. Falkowski-Gilski, Eds. Singapore: Springer Nature Singapore, 2024, pp. 387–402

work page 2024

[8] [8]

Chain-of-thought prompting elicits reasoning in large language models,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022

[9] [9]

Wynne and A

M. Wynne and A. Hellesoy,The cucumber book: behaviour-driven development for testers and developers. Pragmatic Bookshelf, 2012

work page 2012

[10] [10]

Validation of dynamic simulation models using metamorphic testing and given-when-then patterns,

G. Sudheerbabuet al., “Validation of dynamic simulation models using metamorphic testing and given-when-then patterns,” inModelica Conferences, 2025, pp. 139–146

work page 2025

[11] [11]

The generalization of latin hypercube sampling,

M. D. Shields and J. Zhang, “The generalization of latin hypercube sampling,”Reliability Engineering & System Safety, vol. 148, pp. 96– 108, 2016

work page 2016

[12] [12]

DassaultSystémes, “FMPy,” https://fmpy.readthedocs.io/en/latest/, 2017

work page 2017