Using Large Language Models for Black-Box Testing of FMU-Based Simulations
Pith reviewed 2026-05-07 15:57 UTC · model grok-4.3
The pith
LLMs generate Given-When-Then test scenarios from FMU specifications to automate black-box verification of dynamic simulations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompting an LLM with FMU specifications produces valid Given-When-Then scenario goals together with input patterns and assertion oracles. These are assembled into executable plans that generate full input time series, drive the simulation, and evaluate the recorded outputs against the expected patterns. The resulting logs, statistics, and overlaid plots allow direct inspection of whether the model satisfies the stated goals. The process is presented as a practical human-in-the-loop workflow that stores scenarios and results for later re-execution.
What carries the argument
The LLM-prompting pipeline that converts FMU functional and interface specifications into Given-When-Then goals, derives input patterns and output assertions from those goals, executes the simulation, and produces human-readable result logs and plots.
If this is right
- Test scenario creation for dynamic models shifts from fully manual to LLM-assisted with human review.
- Verification produces both aggregate pass rates and per-goal visual overlays that make results easier to interpret.
- Generated scenarios and their results can be stored and replayed for regression or repeated checks.
- The workflow applies to everyday simulation models once the LLM output is inspected for validity.
Where Pith is reading between the lines
- The same prompting structure could be tried on simulation packages that expose similar interface descriptions even if they are not packaged as FMUs.
- Stored scenario sets could serve as a growing library of regression tests that grows with each new model version.
- Prompt variations or different LLM choices could be compared on the same FMU to measure how stable the generated goals remain.
- Integration with automated build systems would let the verification step run whenever an FMU is updated.
Load-bearing premise
The LLM must correctly read the FMU specifications and produce goals plus assertions that accurately describe the expected behavior under input changes.
What would settle it
Generate scenarios for an FMU, run them, and observe that many assertions are ill-formed or that important input changes produce no corresponding output checks.
Figures
read the original abstract
We propose a human in the loop approach for black-box testing of Functional Mock-up Units (FMUs) using Large Language Models (LLMs). The goal is to reduce the manual effort in defining test scenarios for dynamic simulation models and to improve the interpretability of results. The approach takes the functional and interface specifications of an FMU as input, and prompts an LLM to generate structured scenario goals in Given-When-Then format that define the initial input conditions of the simulation, a possible change in those conditions, and the expected output behaviour of the system against those changes. The corresponding scenario plans specify input patterns and add assertion oracles that describe expected output patterns defined in scenario goals. The approach generates a complete input time series for the scenario plans, runs the FMU simulation, and evaluates assertions on the recorded outputs. It produces human-readable logs and plots that show statistics for each scenario with overlays, aggregate pass rates, and per-goal outcomes. The generated scenarios and results are stored for evaluation and later re-execution. We evaluate the approach on a Lube Oil Cooling system and discuss design choices that make the approach practical for everyday use. Results suggest that LLM-assisted scenario generation can facilitate automatic test design and verification of dynamic simulation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a human-in-the-loop workflow that uses LLMs to generate Given-When-Then scenario goals and assertions from FMU functional and interface specifications, produces input time series, executes the FMU, evaluates assertions on outputs, and generates logs, plots, and aggregate pass rates. The method is demonstrated on a single Lube Oil Cooling system FMU, with the claim that LLM-assisted scenario generation can facilitate automatic test design and verification of dynamic simulation models.
Significance. If the generated assertions prove reliable with limited human oversight, the approach could reduce the substantial manual effort currently required to define test scenarios and oracles for FMU-based simulations, while improving result interpretability through structured outputs and visualizations. The human-in-the-loop design is a pragmatic strength for engineering practice.
major comments (2)
- [Evaluation] Evaluation section: The manuscript reports an evaluation on the Lube Oil Cooling system but provides no quantitative metrics whatsoever—no counts of generated scenarios, no fraction requiring human correction, no accuracy or agreement rate between LLM-generated assertions and intended behavior, no pass/fail statistics beyond the qualitative mention of 'aggregate pass rates,' and no baseline comparison to manual test design. This absence directly undermines the central claim that the approach 'facilitates automatic test design.'
- [Approach] Approach and abstract: The core assumption that an LLM can reliably translate FMU specifications into valid Given-When-Then goals and assertions that correctly capture expected dynamic output behavior under input changes is stated but never tested or quantified. Without reported validation (e.g., expert review of oracle correctness or failure-mode analysis), the human-in-the-loop workflow may still demand substantial manual verification, weakening the assertion of reduced effort.
minor comments (2)
- [Approach] The paper would benefit from a concrete example (in text or a figure) showing one generated Given-When-Then goal, the corresponding assertion, the input time series, and the resulting plot with pass/fail overlay.
- [Abstract] Clarify in the abstract and conclusion what specific observations from the Lube Oil Cooling case study support the suggestion that the method facilitates test design, rather than leaving it as a high-level statement.
Simulated Author's Rebuttal
Thank you for the constructive comments. We believe the suggested revisions will improve the manuscript's clarity and rigor. We address each major comment below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The manuscript reports an evaluation on the Lube Oil Cooling system but provides no quantitative metrics whatsoever—no counts of generated scenarios, no fraction requiring human correction, no accuracy or agreement rate between LLM-generated assertions and intended behavior, no pass/fail statistics beyond the qualitative mention of 'aggregate pass rates,' and no baseline comparison to manual test design. This absence directly undermines the central claim that the approach 'facilitates automatic test design.'
Authors: We agree that the evaluation lacks quantitative metrics, which limits the strength of our claims regarding reduced manual effort. The current manuscript presents a proof-of-concept demonstration on the Lube Oil Cooling system, with results described qualitatively through logs, plots, and aggregate pass rates. In the revised manuscript, we will expand the evaluation section to include specific counts of generated scenarios, the proportion that required human intervention or correction, accuracy measures based on expert validation of assertions, detailed pass/fail statistics, and, where possible, a comparison to traditional manual test design efforts. This will provide a more robust basis for the claim that the approach facilitates automatic test design. revision: yes
-
Referee: [Approach] Approach and abstract: The core assumption that an LLM can reliably translate FMU specifications into valid Given-When-Then goals and assertions that correctly capture expected dynamic output behavior under input changes is stated but never tested or quantified. Without reported validation (e.g., expert review of oracle correctness or failure-mode analysis), the human-in-the-loop workflow may still demand substantial manual verification, weakening the assertion of reduced effort.
Authors: The human-in-the-loop design is intended to mitigate risks associated with LLM-generated content by incorporating human oversight. However, we acknowledge that the manuscript does not quantify the reliability of the LLM outputs or the extent of human effort required. We will revise the approach description and evaluation to include details on the validation process, such as the number of scenarios reviewed by experts, the rate of corrections needed, and an analysis of failure modes. This will better demonstrate the practical benefits in terms of effort reduction. revision: yes
Circularity Check
No circularity: procedural workflow with no derivations or self-referential claims
full rationale
The paper presents a human-in-the-loop engineering workflow for LLM-assisted generation of Given-When-Then scenarios and assertions from FMU specifications, followed by simulation execution and result logging. No mathematical equations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described method. The central claim is that the approach facilitates test design; this is supported by a descriptive evaluation on one system rather than any derivation that reduces to its own inputs by construction. No self-citations are invoked as load-bearing for any formal result. The method is self-contained as an applied procedure without circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Modelica, “Functional Mock-up Interface,” https://fmi-standard.org/, 2022
work page 2022
-
[2]
Test generation and test prioritization for simulink models with dynamic behavior,
R. Matinnejadet al., “Test generation and test prioritization for simulink models with dynamic behavior,”IEEE Transactions on Soft- ware Engineering, vol. 45, no. 9, pp. 919–944, 2018
work page 2018
-
[3]
Test intention guided llm-based unit test generation,
Z. Nanet al., “Test intention guided llm-based unit test generation,” in2025 IEEE/ACM 47th International Conference on Software Engi- neering (ICSE). IEEE Computer Society, 2025, pp. 779–779
work page 2025
-
[4]
Aster: Natural and multi-language unit test generation with llms,
R. Panet al., “Aster: Natural and multi-language unit test generation with llms,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2025, pp. 413–424
work page 2025
-
[5]
Automated unit test improvement using large language models at Meta,
N. Alshahwanet al., “Automated unit test improvement using large language models at Meta,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engi- neering, 2024, pp. 185–196
work page 2024
-
[6]
Software testing with large language models: Survey, landscape, and vision,
J. Wanget al., “Software testing with large language models: Survey, landscape, and vision,”IEEE Transactions on Software Engineering, vol. 50, no. 4, pp. 911–936, 2024
work page 2024
-
[7]
Prompt engineering in large language models,
G. Marvinet al., “Prompt engineering in large language models,” in Data Intelligence and Cognitive Informatics, I. J. Jacob, S. Piramuthu, and P. Falkowski-Gilski, Eds. Singapore: Springer Nature Singapore, 2024, pp. 387–402
work page 2024
-
[8]
Chain-of-thought prompting elicits reasoning in large language models,
J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[9]
M. Wynne and A. Hellesoy,The cucumber book: behaviour-driven development for testers and developers. Pragmatic Bookshelf, 2012
work page 2012
-
[10]
Validation of dynamic simulation models using metamorphic testing and given-when-then patterns,
G. Sudheerbabuet al., “Validation of dynamic simulation models using metamorphic testing and given-when-then patterns,” inModelica Conferences, 2025, pp. 139–146
work page 2025
-
[11]
The generalization of latin hypercube sampling,
M. D. Shields and J. Zhang, “The generalization of latin hypercube sampling,”Reliability Engineering & System Safety, vol. 148, pp. 96– 108, 2016
work page 2016
-
[12]
DassaultSystémes, “FMPy,” https://fmpy.readthedocs.io/en/latest/, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.