Automated Test Validators for Flaky Cyber-Physical System Simulators: Approach and Evaluation

Baharin A. Jodat; Khouloud Gaaloul; Mehrdad Sabetzadeh; Shiva Nejati

arxiv: 2508.20902 · v3 · submitted 2025-08-28 · 💻 cs.SE

Automated Test Validators for Flaky Cyber-Physical System Simulators: Approach and Evaluation

Baharin A. Jodat , Khouloud Gaaloul , Mehrdad Sabetzadeh , Shiva Nejati This is my paper

Pith reviewed 2026-05-18 20:51 UTC · model grok-4.3

classification 💻 cs.SE

keywords cyber-physical systemstest validatorsgenetic programmingspectrum-based fault localizationflaky simulatorssimulation-based testingOchiai formulaprecondition violations

0 comments

The pith

Genetic programming using the Ochiai formula produces more accurate test validators for filtering ineffective inputs in flaky cyber-physical system simulators than decision trees or other formulas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops automatic test validators to skip ineffective inputs before running costly CPS simulators that may produce inconsistent results due to flakiness. Two generation methods are compared: genetic programming that treats spectrum-based fault localization formulas as fitness functions, and decision trees or decision rules. The validators target precondition violations, operational design domain limit violations, and inherently safe scenarios. Across aerospace, networking, and autonomous driving case studies, the genetic programming variant with Ochiai shows higher accuracy than the alternatives, and this edge holds even after accounting for simulator flakiness. The validators are also shown to be robust, with low accuracy variation, and most of their assertions match requirements drawn from standards and empirical literature.

Core claim

Test validators generated using genetic programming with the Ochiai spectrum-based fault localization formula are significantly more accurate than those generated using genetic programming with Tarantula and Naish or using decision trees and decision rules. This accuracy advantage remains even when accounting for the flakiness of the simulator. The validators are robust against flakiness, showing only 4 percent average variation in accuracy results across four different network and autonomous-driving systems. On average, 88.7 percent of the assertions inferred by the approach align or overlap with requirements precondition violations, ODD-limit violations, and nominal safe conditions.

What carries the argument

Genetic programming that uses spectrum-based fault localization ranking formulas, especially Ochiai, as fitness functions to evolve boolean expressions classifying test inputs as valid or invalid for simulator execution.

If this is right

Validators can pre-filter test inputs that violate preconditions or exceed ODD limits, avoiding unnecessary simulator runs.
The accuracy advantage persists despite inconsistent outcomes caused by simulator flakiness.
Generated assertions align closely with technical standards and empirical results from the literature.
Robustness is demonstrated with only 4 percent average accuracy variation across multiple flaky systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The filtering step could be inserted early in automated testing pipelines to reduce overall simulation time for large input spaces.
Similar repurposing of fault localization formulas might help other simulation-based domains that face high execution costs.
If the validators prove stable, they could enable broader sampling of critical scenarios without proportional growth in compute demand.

Load-bearing premise

Spectrum-based fault localization ranking formulas such as Ochiai can be repurposed as effective fitness functions inside genetic programming to evolve validators that correctly identify precondition violations, ODD-limit violations, and inherently safe scenarios without needing to execute the simulator.

What would settle it

A new case study on a different CPS domain where the accuracy of GP with Ochiai is not significantly higher than GP with Tarantula, Naish, or decision trees would falsify the central accuracy claim.

Figures

Figures reproduced from arXiv: 2508.20902 by Baharin A. Jodat, Khouloud Gaaloul, Mehrdad Sabetzadeh, Shiva Nejati.

**Figure 1.** Figure 1: Workflow for verdict assignment using (a) test oracles based on system execution, (b) assertion-based test oracles which [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: An assertion-based test oracle for a simplified ADS with inputs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Our approach for deriving assertion-based test oracles [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Syntactic rules of the grammar (denoted by [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Fitness functions for our GP-based condition inference [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Computing 𝑐 𝑝 (𝑐) and 𝑐 𝑓 (𝑐) in SBFL fitness functions in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Pruning inconsistent assertions from test oracles using a [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: (a) An example set of test inputs for a (signal-based) autopilot system, with input signals [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Step-by-step illustration of using rules to derive logical assertion conditions over the signals in Figure 9 from assertion [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 6.** Figure 6: Naish denoted by GPN, Tarantula denoted by GPT , and Ochiai denoted by GPO. To account for the randomness of GP, DT, and DR, we apply each technique 20 times to the training set for each case study. In addition to considering the test oracle generation methods individually, we also consider an ensemble approach. Specifically, for each run of GPN, GPT , GPO, DT, and DR, the ensemble method computes the uni… view at source ↗

**Figure 11.** Figure 11: Illustrations of the percentage of unique correct [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 13.** Figure 13: Pass-as-Fail rates of the test oracles generated by [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 15.** Figure 15: Average relative accuracies of the test oracles gener [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Trade-off between relative accuracy and inconclusive [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

Simulation-based testing of cyber-physical systems (CPS) is costly due to the time-consuming execution of CPS simulators. In addition, CPS simulators may be flaky, leading to inconsistent test outcomes and requiring repeated test re-execution for reliable test verdicts. Many test inputs within the input space of CPS may not effectively exercise the behaviour of the system under test (SUT) -- for instance, those that violate system preconditions, exceed operational design domain (ODD) limits, or represent inherently safe scenarios. In this article, we propose to use test validators to filter out such test inputs before execution. We describe two methods for generating test validators: one using genetic programming (GP) that employs well-known spectrum-based fault localization (SBFL) ranking formulas, namely Ochiai, Tarantula, and Naish, as fitness functions; and the other using decision trees (DT) and decision rules (DR). We evaluate our test validators through case studies in the domains of aerospace, networking and autonomous driving. We show that test validators generated using GP with Ochiai are significantly more accurate than those generated using GP with Tarantula and Naish or using DT or DR. Moreover, this accuracy advantage remains even when accounting for the flakiness of the simulator. We further show that our test validators generated by GP with Ochiai are robust against flakiness with only 4% average variation in their accuracy results across four different network and autonomous-driving systems with flaky behaviours. Finally, we show that, on average, 88.7% of the assertions inferred by our approach align or overlap with requirements precondition violations, ODD-limit violations, and nominal safe conditions extracted from technical standards and empirical results in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts SBFL formulas like Ochiai as GP fitness functions to generate input validators for flaky CPS simulators, with reported accuracy and robustness gains across four systems.

read the letter

The main takeaway is that genetic programming guided by Ochiai as a fitness function produces more accurate validators than Tarantula, Naish, decision trees, or rules for filtering precondition violations, ODD-limit cases, and safe scenarios in CPS simulators. The accuracy edge holds up under flakiness, with only 4% average variation reported across the evaluated systems, and most assertions align with requirements from standards and literature. This is a direct attempt to cut simulation costs by skipping ineffective inputs upfront. The evaluation on aerospace, networking, and autonomous driving cases gives it some practical grounding, and the robustness check against flakiness is a useful addition for real-world use. The combination of GP with these particular SBFL formulas for validator synthesis is not a straight lift from prior work in the abstract. On the soft spots, the abstract gives comparative accuracy numbers but no statistical tests, effect sizes, or full protocol details, which leaves the strength of the superiority claim hard to judge from the summary alone. The stress-test point about mapping SBFL spectra to validator predicates is worth checking in the methods section; if the paper does not give an explicit construction of the a/b/c/d counts for this new context, the Ochiai advantage could be tied to their specific encoding rather than a clean transfer of the formulas. Reproducibility would benefit from data or code availability. This is for people working on simulation-based testing and input reduction in cyber-physical systems, especially those facing flaky simulators. A reader focused on automated testing for embedded or autonomous systems would find concrete ideas here. It deserves peer review to get feedback on the fitness mapping and analysis details.

Referee Report

3 major / 2 minor

Summary. The paper proposes automated generation of test validators to filter ineffective inputs (precondition violations, ODD-limit violations, inherently safe scenarios) for flaky CPS simulators, thereby reducing execution costs. Two generation methods are described: genetic programming (GP) that repurposes spectrum-based fault localization (SBFL) formulas (Ochiai, Tarantula, Naish) as fitness functions, and decision trees/rules (DT/DR). Evaluation on aerospace, networking, and autonomous-driving case studies claims that GP with Ochiai yields significantly higher accuracy than the alternatives, that this advantage persists under simulator flakiness, that accuracy varies only 4% on average across four flaky systems, and that 88.7% of inferred assertions align with literature-derived requirements.

Significance. If the central empirical claims hold after rigorous statistical validation and clearer method exposition, the work could meaningfully lower the cost of simulation-based CPS testing in safety-critical domains. The explicit handling of flakiness and the reported alignment with external standards are practical strengths. The approach also offers a novel transfer of SBFL techniques into test-input filtering, which could be extended if the mapping from coverage spectra to validator fitness is shown to be general rather than artifactual.

major comments (3)

[Abstract and §5] Abstract and §5 (Evaluation): the repeated claim that GP-Ochiai validators are 'significantly more accurate' is unsupported by any statistical test, confidence interval, effect size, or raw-data summary. The reported accuracy advantage and the 4% flakiness-variation figure therefore remain descriptive rather than inferential, weakening the central comparative result.
[§3.2] §3.2 (GP fitness function definition): the mapping from SBFL spectra (counts a, b, c, d) to a fitness function over candidate validator predicates or features is not explicitly constructed. Without this definition it is unclear why Ochiai, Tarantula, or Naish remain meaningful outside their original code-coverage setting or whether the reported superiority is an artifact of the particular feature encoding and labeling scheme.
[§4 and §5] §4 and §5: the experimental protocol (number of GP runs, population size, termination criteria, how flakiness is injected and measured, train/test split for validator accuracy) is not fully specified, preventing independent reproduction or assessment of robustness claims.

minor comments (2)

[Figures 4-7] Table captions and axis labels in the accuracy and robustness plots should explicitly state the number of independent runs and the exact accuracy metric (e.g., precision, recall, F1) used.
[§5.3] The 88.7% alignment figure would benefit from a breakdown by domain and by type of violation (precondition vs. ODD vs. safe scenario) to show whether the result is uniform.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where revisions will strengthen the manuscript and where we provide additional clarification.

read point-by-point responses

Referee: [Abstract and §5] the repeated claim that GP-Ochiai validators are 'significantly more accurate' is unsupported by any statistical test, confidence interval, effect size, or raw-data summary. The reported accuracy advantage and the 4% flakiness-variation figure therefore remain descriptive rather than inferential.

Authors: We agree that the term 'significantly' was used in a descriptive sense in the current draft. In the revised version we will replace this with inferential statistics: we will report results from 30 independent GP runs, apply the Wilcoxon signed-rank test with p-values, compute Cohen's d effect sizes, and include 95% confidence intervals for the accuracy differences. Raw per-run accuracy tables will be added to an appendix or supplementary material. The 4% variation figure will similarly be accompanied by standard deviation and range across the four systems. revision: yes
Referee: [§3.2] the mapping from SBFL spectra (counts a, b, c, d) to a fitness function over candidate validator predicates or features is not explicitly constructed. Without this definition it is unclear why Ochiai, Tarantula, or Naish remain meaningful outside their original code-coverage setting.

Authors: The fitness function is obtained by treating each candidate validator predicate as a binary classifier over the set of executed test inputs: a = number of effective inputs where the predicate evaluates true, b = number of ineffective inputs where it evaluates true, c = number of effective inputs where it evaluates false, d = number of ineffective inputs where it evaluates false. The SBFL formula is then applied directly to these four counts to produce the fitness value. We will insert an explicit equation and a short paragraph in §3.2 that defines this mapping and explains why the formulas remain semantically meaningful when the 'spectrum' is derived from input-effectiveness labels rather than statement coverage. revision: yes
Referee: [§4 and §5] the experimental protocol (number of GP runs, population size, termination criteria, how flakiness is injected and measured, train/test split for validator accuracy) is not fully specified, preventing independent reproduction or assessment of robustness claims.

Authors: We acknowledge that several parameter values and procedural steps were described at a high level. In the revision we will expand both sections with the following concrete details: 30 independent GP runs per configuration, population size of 100, tournament selection, 100-generation limit or fitness convergence of 0.01, flakiness injection via Gaussian noise on simulator outputs with variance calibrated to observed real-world flakiness rates, accuracy measured on a held-out 30% test set after training on 70%, and explicit random-seed reporting. A new subsection will tabulate all hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of GP-based test validators

full rationale

The paper's claims rest on comparative experiments across aerospace, networking, and autonomous-driving case studies, measuring validator accuracy against ground-truth labels for precondition/ODD/safe-scenario violations and checking robustness to simulator flakiness. SBFL formulas are adopted as GP fitness functions via an explicit methodological choice, with performance differences reported empirically rather than derived by construction from the evaluation data itself. Alignment with literature requirements (88.7% overlap) serves as an external validation step, not a definitional input. No equations, self-citations, or renamings reduce the reported accuracy advantages or robustness figures to tautological re-expressions of the same fitted quantities or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard domain assumptions about CPS testing and simulator behavior rather than new free parameters or invented entities.

axioms (2)

domain assumption CPS simulators can produce inconsistent outcomes for the same input due to flakiness, requiring repeated executions for reliable verdicts.
Explicitly stated as a core motivation in the abstract.
domain assumption A substantial fraction of test inputs violate preconditions, exceed ODD limits, or represent inherently safe scenarios and can therefore be filtered without execution.
Central premise justifying the use of validators.

pith-pipeline@v0.9.0 · 5860 in / 1478 out tokens · 41471 ms · 2026-05-18T20:51:04.896745+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong
cs.SE 2026-04 unverdicted novelty 5.0

A grammar-constrained counterfactual refinement framework resolves inconsistencies in safety operational rules for an autonomous driving system while staying syntactically valid.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

[Online]

(Accessed: September 2025) Raquel urtasun’s tech company develops self-driving vehicle simulator. [Online]. Available: https://www.thestar. com/business/raquel-urtasun-s-tech-company-develops-self-driving-v ehicle-simulator/article_4fc552f3-cbec-523c-ad3a-ec6aa93cdad7.html

work page 2025
[2]

Machine learning-based test selection for simulation-based testing of self-driving cars software,

C. Birchler, S. Khatiri, B. Bosshard, A. Gambi, and S. Panichella, “Machine learning-based test selection for simulation-based testing of self-driving cars software,” Empirical Software Engineering , vol. 28, no. 3, p. 71, 2023

work page 2023
[3]

Salvo: Automated generation of diversified tests for self-driving cars from existing maps,

V . Nguyen, S. Huber, and A. Gambi, “Salvo: Automated generation of diversified tests for self-driving cars from existing maps,” in2021 IEEE International Conference on Artificial Intelligence Testing (AITest) . IEEE, 2021, pp. 128–135

work page 2021
[4]

Simulation-based testing of unmanned aerial vehicles with aerialist,

S. Khatiri, S. Panichella, and P. Tonella, “Simulation-based testing of unmanned aerial vehicles with aerialist,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 134–138

work page 2024
[5]

An empirical analysis of flaky tests,

Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” inProceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653

work page 2014
[6]

A survey of flaky tests,

O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “A survey of flaky tests,” ACM Transactions on Software Engineering and Method- ology (TOSEM), vol. 31, no. 1, pp. 1–74, 2021

work page 2021
[7]

Constructing automated test oracle for low observable software,

M. Valueian, N. Attar, H. Haghighi, and M. Vahidi-Asl, “Constructing automated test oracle for low observable software,” Scientia Iranica , vol. 27, no. 3, pp. 1333–1351, 2020

work page 2020
[8]

Using a neural network in the software testing process,

M. Vanmali, M. Last, and A. Kandel, “Using a neural network in the software testing process,” International Journal of Intelligent Systems , vol. 17, no. 1, pp. 45–62, 2002

work page 2002
[9]

An automated framework for software test oracle,

S. R. Shahamiri, W. M. N. W. Kadir, S. Ibrahim, and S. Z. M. Hashim, “An automated framework for software test oracle,” Information and Software Technology, vol. 53, no. 7, pp. 774–788, 2011. 23

work page 2011
[10]

Artificial neural networks as multi-networks automated test oracle,

S. R. Shahamiri, W. M. Wan-Kadir, S. Ibrahim, and S. Z. M. Hashim, “Artificial neural networks as multi-networks automated test oracle,” Automated Software Engineering , vol. 19, pp. 303–334, 2012

work page 2012
[11]

An approach to design test oracle for aspect oriented software systems using soft computing approach,

A. Singhal, A. Bansal, and A. Kumar, “An approach to design test oracle for aspect oriented software systems using soft computing approach,” International Journal of System Assurance Engineering and Management, vol. 7, pp. 1–5, 2016

work page 2016
[12]

A classifier-based test oracle for embedded software,

F. Gholami, N. Attar, H. Haghighi, M. V . Asl, M. Valueian, and S. Mo- hamadyari, “A classifier-based test oracle for embedded software,” in 2018 Real-Time and Embedded Systems and Technologies (RTEST) . IEEE, 2018, pp. 104–111

work page 2018
[13]

A machine learning approach to generate test oracles,

R. Braga, P. S. Neto, R. Rabêlo, J. Santiago, and M. Souza, “A machine learning approach to generate test oracles,” in Proceedings of the XXXII Brazilian Symposium on Software Engineering , 2018, pp. 142–151

work page 2018
[14]

Human-in-the-loop automatic program repair,

C. Geethal, M. Böhme, and V .-T. Pham, “Human-in-the-loop automatic program repair,” IEEE Transactions on Software Engineering , 2023

work page 2023
[15]

On the accuracy of spectrum-based fault localization,

R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” in Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART- MUTATION 2007). IEEE, 2007, pp. 89–98

work page 2007
[16]

Empirical evaluation of the tarantula automatic fault-localization technique,

J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantula automatic fault-localization technique,” in Proceedings of the 20th IEEE/ACM international Conference on Automated software engineer- ing, 2005, pp. 273–282

work page 2005
[17]

A model for spectra- based software diagnosis,

L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra- based software diagnosis,” ACM Transactions on software engineering and methodology (TOSEM) , vol. 20, no. 3, pp. 1–32, 2011

work page 2011
[18]

Localizing multiple faults in simulink models,

B. Liu, Lucia, S. Nejati, L. C. Briand, and T. Bruckmann, “Localizing multiple faults in simulink models,” in IEEE 23rd International Con- ference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1 . IEEE Computer Society, 2016, pp. 146–156

work page 2016
[19]

Monitoring temporal properties of con- tinuous signals,

O. Maler and D. Nickovic, “Monitoring temporal properties of con- tinuous signals,” in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems. Springer, 2004, pp. 152–166

work page 2004
[20]

[Online]

(Accessed: September 2025) Lockheed martin. [Online]. Available: https://www.lockheedmartin.com

work page 2025
[21]

Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,

C. Menghi, S. Nejati, K. Gaaloul, and L. C. Briand, “Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,” in Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering , 2019, pp. 27–38

work page 2019
[22]

[Online]

(Accessed: September 2025) Cruise control test generation. [Online]. Available: https://www.mathworks.com/help/sldv/ug/cruise-control-tes t-generation.html

work page 2025
[23]

[Online]

(Accessed: September 2025) Building a clutch lock-up model. [Online]. Available: https://www.mathworks.com/help/simulink/slref/ building-a-clutch-lock-up-model.html

work page 2025
[24]

[Online]

(Accessed: September 2025) Design a guidance system in matlab and simulink. [Online]. Available: https://www.mathworks.com/help/simul ink/slref/designing-a-guidance-system-in-matlab-and-simulink.html

work page 2025
[25]

[Online]

(Accessed: September 2025) Dc motor model simulink model. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexc hange/11587-dc-motor-model-simulink

work page 2025
[26]

Arch- comp 2024 category report: Falsification,

T. Khandait, F. Formica, P. Arcaini, S. Chotaliya, G. Fainekos, A. Hekal, A. Kundu, E. Lew, M. Loreti, C. Menghi et al. , “Arch- comp 2024 category report: Falsification,” in Proceedings of the 11th Int. Workshop on Applied , vol. 103, 2024, pp. 122–144

work page 2024
[27]

[Online]

(Accessed: September 2025) Replication package for the article. [Online]. Available: https://doi.org/10.5281/zenodo.16912908

work page doi:10.5281/zenodo.16912908 2025
[28]

The oracle problem in software testing: A survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2015

work page 2015
[29]

Luke, Essentials of Metaheuristics , 2nd ed

S. Luke, Essentials of Metaheuristics , 2nd ed. Lulu, 2013, available for free at http://cs.gmu.edu/ ∼sean/book/metaheuristics/

work page 2013
[30]

Test generation strategies for building failure models and explaining spurious failures,

B. A. Jodat, A. Chandar, S. Nejati, and M. Sabetzadeh, “Test generation strategies for building failure models and explaining spurious failures,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, pp. 1–32, 2024

work page 2024
[31]

Combining genetic programming and model checking to generate en- vironment assumptions,

K. Gaaloul, C. Menghi, S. Nejati, L. C. Briand, and Y . I. Parache, “Combining genetic programming and model checking to generate en- vironment assumptions,” IEEE Transactions on Software Engineering , vol. 48, no. 9, pp. 3664–3685, 2021

work page 2021
[32]

Using genetic programming to build self-adaptivity into software-defined networks,

J. Li, S. Nejati, and M. Sabetzadeh, “Using genetic programming to build self-adaptivity into software-defined networks,” ACM Transac- tions on Autonomous and Adaptive Systems , vol. 19, no. 1, pp. 1–35, 2024

work page 2024
[33]

Structure-based constants in genetic programming,

C. B. Veenhuis, “Structure-based constants in genetic programming,” in Progress in Artificial Intelligence: 16th Portuguese Conference on Artificial Intelligence, EPIA 2013, Angra do Heroísmo, Azores, Portugal, September 9-12, 2013. Proceedings 16 . Springer, 2013, pp. 126–137

work page 2013
[34]

Harman, P

M. Harman, P. McMinn, J. T. de Souza, and S. Yoo, Search Based Software Engineering: Techniques, Taxonomy, Tutorial. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 1–59, ISBN: 978-3-642-25231-0. [Online]. Available: https://doi.org/10.1007/978-3 -642-25231-0_1

work page doi:10.1007/978-3 2012
[35]

R. Poli, W. B. Langdon, and N. F. McPhee, A Field Guide to Genetic Programming. Lulu.com, 2008, ISBN: 978-1-4092-0073-4

work page 2008
[36]

Evaluation of measures for statistical fault localisation and an optimising scheme,

D. Landsberg, H. Chockler, D. Kroening, and M. Lewis, “Evaluation of measures for statistical fault localisation and an optimising scheme,” in Fundamental Approaches to Software Engineering: 18th International Conference, FASE 2015, Held as Part of the European Joint Confer- ences on Theory and Practice of Software, ETAPS 2015, London, UK, April 11-18, 20...

work page 2015
[37]

Molnar, Interpretable machine learning

C. Molnar, Interpretable machine learning . Lulu. com, 2020, ISBN: 979-8411463330

work page 2020
[38]

Z3: An efficient smt solver,

L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, 2008, pp. 337–340

work page 2008
[39]

Requirements-driven test generation for autonomous vehicles with machine learning components,

C. E. Tuncali, G. Fainekos, D. Prokhorov, H. Ito, and J. Kapinski, “Requirements-driven test generation for autonomous vehicles with machine learning components,” IEEE Transactions on Intelligent Vehi- cles, vol. 5, no. 2, pp. 265–280, 2019

work page 2019
[40]

Pareto efficient multi-objective black-box test case selection for simulation-based testing,

A. Arrieta, S. Wang, U. Markiegi, A. Arruabarrena, L. Etxeberria, and G. Sagardui, “Pareto efficient multi-objective black-box test case selection for simulation-based testing,” Information and Software Tech- nology, 2019

work page 2019
[41]

Mining assumptions for software components using machine learning,

K. Gaaloul, C. Menghi, S. Nejati, L. C. Briand, and D. Wolfe, “Mining assumptions for software components using machine learning,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Soft- ware Engineering, 2020, pp. 159–171

work page 2020
[42]

Arch-comp 2019 category report: Falsification

G. Ernst, P. Arcaini, A. Donze, G. Fainekos, L. Mathesen, G. Pedrielli, S. Yaghoubi, Y . Yamagata, and Z. Zhang, “Arch-comp 2019 category report: Falsification.” in ARCH@ CPSIoTWeek, 2019, pp. 129–140

work page 2019
[43]

Learning non- robustness using simulation-based testing: a network traffic-shaping case study,

B. A. Jodat, S. Nejati, M. Sabetzadeh, and P. Saavedra, “Learning non- robustness using simulation-based testing: a network traffic-shaping case study,” in 2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2023, pp. 386–397

work page 2023
[44]

D. K. Chaturvedi, Modeling and simulation of systems using MAT- LAB® and Simulink® . CRC press, 2017, ISBN: 978-1439806722

work page 2017
[45]

[Online]

(Accessed: September 2025) Navigating the do-178c certification process for airborne software. [Online]. Available: https://thecloudstra p.com/navigating-the-do-178c-certification-process/

work page 2025
[46]

[Online]

(Accessed: September 2025) Autopilot online benchmark. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexchange/41 490-autopilot-demo-for-arp4754a-do-178c-and-do-331

work page 2025
[47]

[Online]

(Accessed: September 2025) Beamng.tech. [Online]. Available: https://beamng.tech

work page 2025
[48]

Control strategies for autonomous vehicles,

C. V . Samak, T. V . Samak, and S. Kandhasamy, “Control strategies for autonomous vehicles,” in Autonomous driving and advanced driver- assistance systems (ADAS) . CRC Press, 2021, pp. 37–86

work page 2021
[49]

End to End Learning for Self-Driving Cars

M. Bojarski, D. W. del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self- driving cars,” ArXiv, vol. abs/1604.07316, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:15780954

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

[Online]

(Accessed: September 2025) Github repo for cyber-physical systems testing tool competition. [Online]. Available: https://github.com/sbft-c ps-tool-competition/cps-tool-competition

work page 2025
[51]

Evaluating the impact of flaky simulators on testing autonomous driving systems,

M. H. Amini, S. Naseri, and S. Nejati, “Evaluating the impact of flaky simulators on testing autonomous driving systems,” Empirical Software Engineering, vol. 29, no. 2, pp. 1–30, 2024

work page 2024
[52]

Digital twins are not monozygotic - cross-replicating ADAS testing in two industry-grade automotive simulators,

M. Borg, R. B. Abdessalem, S. Nejati, F. Jegeden, and D. Shin, “Digital twins are not monozygotic - cross-replicating ADAS testing in two industry-grade automotive simulators,” in 14th IEEE Conference on Software Testing, Verification and Validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021 . IEEE, 2021, pp. 383–393

work page 2021
[53]

Practical bayesian optimiza- tion of machine learning algorithms,

J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- tion of machine learning algorithms,” Advances in neural information processing systems, vol. 25, 2012. 24

work page 2012
[54]

A comparison of bloat control methods for genetic programming,

S. Luke and L. Panait, “A comparison of bloat control methods for genetic programming,” Evolutionary computation , vol. 14, no. 3, pp. 309–344, 2006

work page 2006
[55]

On a test of whether one of two random variables is stochastically larger than the other,

H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947

work page 1947
[56]

A critique and improvement of the cl common language effect size statistics of mcgraw and wong,

A. Vargha and H. D. Delaney, “A critique and improvement of the cl common language effect size statistics of mcgraw and wong,” Journal of Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101–132, 2000

work page 2000
[57]

Controlling the false discovery rate: a practical and powerful approach to multiple testing,

Y . Benjamini and Y . Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal statistical society: series B (Methodological) , vol. 57, no. 1, pp. 289–300, 1995

work page 1995
[58]

Simulation-based test case generation for unmanned aerial vehicles in the neighborhood of real flights,

S. Khatiri, S. Panichella, and P. Tonella, “Simulation-based test case generation for unmanned aerial vehicles in the neighborhood of real flights,” in2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2023, pp. 281–292

work page 2023
[59]

A practical guide for using statistical tests to assess randomized algorithms in software engineering,

A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in Proceedings of the 33rd international conference on software engineering, 2011, pp. 1–10

work page 2011
[60]

Automated formalization of structured natural language requirements,

D. Giannakopoulou, T. Pressburger, A. Mavridou, and J. Schumann, “Automated formalization of structured natural language requirements,” Information and Software Technology , vol. 137, p. 106590, 2021

work page 2021
[61]

Evaluating model testing and model checking for finding requirements violations in simulink models,

S. Nejati, K. Gaaloul, C. Menghi, L. C. Briand, S. Foster, and D. Wolfe, “Evaluating model testing and model checking for finding requirements violations in simulink models,” in Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2019, pp. 1015– 1025

work page 2019
[62]

[Online]

(Accessed: September 2025) End-to-end deep learning for self-driving cars. [Online]. Available: https://developer.nvidia.com/blog/deep-learn ing-self-driving-cars/

work page 2025
[63]

The daikon system for dynamic detection of likely invariants,

M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao, “The daikon system for dynamic detection of likely invariants,” Science of computer programming , vol. 69, no. 1-3, pp. 35–45, 2007

work page 2007
[64]

Test oracle assessment and improvement,

G. Jahangirova, D. Clark, M. Harman, and P. Tonella, “Test oracle assessment and improvement,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 247–258

work page 2016
[65]

Evolutionary improvement of assertion oracles,

V . Terragni, G. Jahangirova, P. Tonella, and M. Pezzè, “Evolutionary improvement of assertion oracles,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2020, pp. 1178–1189

work page 2020
[66]

Using semi-supervised learning for pre- dicting metamorphic relations,

B. Hardin and U. Kanewala, “Using semi-supervised learning for pre- dicting metamorphic relations,” in Proceedings of the 3rd International Workshop on Metamorphic Testing, 2018, pp. 14–17

work page 2018
[67]

Using machine learning techniques to detect metamorphic relations for programs without test oracles,

U. Kanewala and J. M. Bieman, “Using machine learning techniques to detect metamorphic relations for programs without test oracles,” in 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2013, pp. 1–10

work page 2013
[68]

Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels,

U. Kanewala, J. M. Bieman, and A. Ben-Hur, “Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels,” Software testing, verification and reliability , vol. 26, no. 3, pp. 245–269, 2016

work page 2016
[69]

Leveraging mutants for automatic prediction of metamorphic relations using machine learning,

A. Nair, K. Meinke, and S. Eldh, “Leveraging mutants for automatic prediction of metamorphic relations using machine learning,” in Pro- ceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation , 2019, pp. 1–6

work page 2019
[70]

Rbf-mlmr: A multi-label metamorphic relation prediction approach using rbf neural network,

P. Zhang, X. Zhou, P. Pelliccione, and H. Leung, “Rbf-mlmr: A multi-label metamorphic relation prediction approach using rbf neural network,” IEEE access, vol. 5, pp. 21 791–21 805, 2017

work page 2017
[71]

Gen- morph: Automatically generating metamorphic relations via genetic programming,

J. Ayerdi, V . Terragni, G. Jahangirova, A. Arrieta, and P. Tonella, “Gen- morph: Automatically generating metamorphic relations via genetic programming,” IEEE Transactions on Software Engineering , 2024

work page 2024
[72]

Ex- ploratory test oracle using multi-layer perceptron neural network,

W. Makondo, R. Nallanthighal, I. Mapanga, and P. Kadebu, “Ex- ploratory test oracle using multi-layer perceptron neural network,” in 2016 International Conference on Advances in Computing, Communi- cations and Informatics (ICACCI) . IEEE, 2016, pp. 1166–1171

work page 2016
[73]

An automated oracle approach to test decision-making structures,

S. R. Shahamiri, W. M. N. W. Kadir, and S. bin Ibrahim, “An automated oracle approach to test decision-making structures,” in 2010 3rd International Conference on Computer Science and Information Technology, vol. 5. IEEE, 2010, pp. 30–34

work page 2010
[74]

A neural net based approach to test oracle,

K. Aggarwal, Y . Singh, A. Kaur, and O. Sangwan, “A neural net based approach to test oracle,” ACM SIGSOFT Software Engineering Notes , vol. 29, no. 3, pp. 1–6, 2004

work page 2004
[75]

Artificial neural network for automatic test oracles generation,

H. Jin, Y . Wang, N.-W. Chen, Z.-J. Gou, and S. Wang, “Artificial neural network for automatic test oracles generation,” in 2008 International Conference on Computer Science and Software Engineering , vol. 2. IEEE, 2008, pp. 727–730

work page 2008
[76]

Performing software test oracle based on deep neural network with fuzzy inference system,

A. K. Monsefi, B. Zakeri, S. Samsam, and M. Khashehchi, “Performing software test oracle based on deep neural network with fuzzy inference system,” in High-Performance Computing and Big Data Analysis: Second International Congress, TopHPC 2019, Tehran, Iran, April 23– 25, 2019, Revised Selected Papers 2 . Springer, 2019, pp. 406–417

work page 2019
[77]

Radial basis function neural network based approach to test oracle,

O. P. Sangwan, P. K. Bhatia, and Y . Singh, “Radial basis function neural network based approach to test oracle,”ACM SIGSOFT Software Engineering Notes, vol. 36, no. 5, pp. 1–5, 2011

work page 2011
[78]

Automated test oracle based on neural networks,

M. Ye, B. Feng, L. Zhu, and Y . Lin, “Automated test oracle based on neural networks,” in 2006 5th IEEE International Conference on Cognitive Informatics, vol. 1. IEEE, 2006, pp. 517–522

work page 2006
[79]

Automatic test oracle based on probabilistic neural networks,

R. Zhang, Y .-w. Wang, and M.-z. Zhang, “Automatic test oracle based on probabilistic neural networks,” inRecent Developments in Intelligent Computing, Communication and Devices: Proceedings of ICCD 2017 . Springer, 2019, pp. 437–445

work page 2017
[80]

Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study,

J. Ayerdi, V . Terragni, A. Arrieta, P. Tonella, G. Sagardui, and M. Ar- ratibel, “Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering , 2021, pp. 1264–1274

work page 2021

Showing first 80 references.

[1] [1]

[Online]

(Accessed: September 2025) Raquel urtasun’s tech company develops self-driving vehicle simulator. [Online]. Available: https://www.thestar. com/business/raquel-urtasun-s-tech-company-develops-self-driving-v ehicle-simulator/article_4fc552f3-cbec-523c-ad3a-ec6aa93cdad7.html

work page 2025

[2] [2]

Machine learning-based test selection for simulation-based testing of self-driving cars software,

C. Birchler, S. Khatiri, B. Bosshard, A. Gambi, and S. Panichella, “Machine learning-based test selection for simulation-based testing of self-driving cars software,” Empirical Software Engineering , vol. 28, no. 3, p. 71, 2023

work page 2023

[3] [3]

Salvo: Automated generation of diversified tests for self-driving cars from existing maps,

V . Nguyen, S. Huber, and A. Gambi, “Salvo: Automated generation of diversified tests for self-driving cars from existing maps,” in2021 IEEE International Conference on Artificial Intelligence Testing (AITest) . IEEE, 2021, pp. 128–135

work page 2021

[4] [4]

Simulation-based testing of unmanned aerial vehicles with aerialist,

S. Khatiri, S. Panichella, and P. Tonella, “Simulation-based testing of unmanned aerial vehicles with aerialist,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 134–138

work page 2024

[5] [5]

An empirical analysis of flaky tests,

Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” inProceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653

work page 2014

[6] [6]

A survey of flaky tests,

O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “A survey of flaky tests,” ACM Transactions on Software Engineering and Method- ology (TOSEM), vol. 31, no. 1, pp. 1–74, 2021

work page 2021

[7] [7]

Constructing automated test oracle for low observable software,

M. Valueian, N. Attar, H. Haghighi, and M. Vahidi-Asl, “Constructing automated test oracle for low observable software,” Scientia Iranica , vol. 27, no. 3, pp. 1333–1351, 2020

work page 2020

[8] [8]

Using a neural network in the software testing process,

M. Vanmali, M. Last, and A. Kandel, “Using a neural network in the software testing process,” International Journal of Intelligent Systems , vol. 17, no. 1, pp. 45–62, 2002

work page 2002

[9] [9]

An automated framework for software test oracle,

S. R. Shahamiri, W. M. N. W. Kadir, S. Ibrahim, and S. Z. M. Hashim, “An automated framework for software test oracle,” Information and Software Technology, vol. 53, no. 7, pp. 774–788, 2011. 23

work page 2011

[10] [10]

Artificial neural networks as multi-networks automated test oracle,

S. R. Shahamiri, W. M. Wan-Kadir, S. Ibrahim, and S. Z. M. Hashim, “Artificial neural networks as multi-networks automated test oracle,” Automated Software Engineering , vol. 19, pp. 303–334, 2012

work page 2012

[11] [11]

An approach to design test oracle for aspect oriented software systems using soft computing approach,

A. Singhal, A. Bansal, and A. Kumar, “An approach to design test oracle for aspect oriented software systems using soft computing approach,” International Journal of System Assurance Engineering and Management, vol. 7, pp. 1–5, 2016

work page 2016

[12] [12]

A classifier-based test oracle for embedded software,

F. Gholami, N. Attar, H. Haghighi, M. V . Asl, M. Valueian, and S. Mo- hamadyari, “A classifier-based test oracle for embedded software,” in 2018 Real-Time and Embedded Systems and Technologies (RTEST) . IEEE, 2018, pp. 104–111

work page 2018

[13] [13]

A machine learning approach to generate test oracles,

R. Braga, P. S. Neto, R. Rabêlo, J. Santiago, and M. Souza, “A machine learning approach to generate test oracles,” in Proceedings of the XXXII Brazilian Symposium on Software Engineering , 2018, pp. 142–151

work page 2018

[14] [14]

Human-in-the-loop automatic program repair,

C. Geethal, M. Böhme, and V .-T. Pham, “Human-in-the-loop automatic program repair,” IEEE Transactions on Software Engineering , 2023

work page 2023

[15] [15]

On the accuracy of spectrum-based fault localization,

R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” in Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART- MUTATION 2007). IEEE, 2007, pp. 89–98

work page 2007

[16] [16]

Empirical evaluation of the tarantula automatic fault-localization technique,

J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantula automatic fault-localization technique,” in Proceedings of the 20th IEEE/ACM international Conference on Automated software engineer- ing, 2005, pp. 273–282

work page 2005

[17] [17]

A model for spectra- based software diagnosis,

L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra- based software diagnosis,” ACM Transactions on software engineering and methodology (TOSEM) , vol. 20, no. 3, pp. 1–32, 2011

work page 2011

[18] [18]

Localizing multiple faults in simulink models,

B. Liu, Lucia, S. Nejati, L. C. Briand, and T. Bruckmann, “Localizing multiple faults in simulink models,” in IEEE 23rd International Con- ference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1 . IEEE Computer Society, 2016, pp. 146–156

work page 2016

[19] [19]

Monitoring temporal properties of con- tinuous signals,

O. Maler and D. Nickovic, “Monitoring temporal properties of con- tinuous signals,” in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems. Springer, 2004, pp. 152–166

work page 2004

[20] [20]

[Online]

(Accessed: September 2025) Lockheed martin. [Online]. Available: https://www.lockheedmartin.com

work page 2025

[21] [21]

Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,

C. Menghi, S. Nejati, K. Gaaloul, and L. C. Briand, “Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,” in Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering , 2019, pp. 27–38

work page 2019

[22] [22]

[Online]

(Accessed: September 2025) Cruise control test generation. [Online]. Available: https://www.mathworks.com/help/sldv/ug/cruise-control-tes t-generation.html

work page 2025

[23] [23]

[Online]

(Accessed: September 2025) Building a clutch lock-up model. [Online]. Available: https://www.mathworks.com/help/simulink/slref/ building-a-clutch-lock-up-model.html

work page 2025

[24] [24]

[Online]

(Accessed: September 2025) Design a guidance system in matlab and simulink. [Online]. Available: https://www.mathworks.com/help/simul ink/slref/designing-a-guidance-system-in-matlab-and-simulink.html

work page 2025

[25] [25]

[Online]

(Accessed: September 2025) Dc motor model simulink model. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexc hange/11587-dc-motor-model-simulink

work page 2025

[26] [26]

Arch- comp 2024 category report: Falsification,

T. Khandait, F. Formica, P. Arcaini, S. Chotaliya, G. Fainekos, A. Hekal, A. Kundu, E. Lew, M. Loreti, C. Menghi et al. , “Arch- comp 2024 category report: Falsification,” in Proceedings of the 11th Int. Workshop on Applied , vol. 103, 2024, pp. 122–144

work page 2024

[27] [27]

[Online]

(Accessed: September 2025) Replication package for the article. [Online]. Available: https://doi.org/10.5281/zenodo.16912908

work page doi:10.5281/zenodo.16912908 2025

[28] [28]

The oracle problem in software testing: A survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2015

work page 2015

[29] [29]

Luke, Essentials of Metaheuristics , 2nd ed

S. Luke, Essentials of Metaheuristics , 2nd ed. Lulu, 2013, available for free at http://cs.gmu.edu/ ∼sean/book/metaheuristics/

work page 2013

[30] [30]

Test generation strategies for building failure models and explaining spurious failures,

B. A. Jodat, A. Chandar, S. Nejati, and M. Sabetzadeh, “Test generation strategies for building failure models and explaining spurious failures,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, pp. 1–32, 2024

work page 2024

[31] [31]

Combining genetic programming and model checking to generate en- vironment assumptions,

K. Gaaloul, C. Menghi, S. Nejati, L. C. Briand, and Y . I. Parache, “Combining genetic programming and model checking to generate en- vironment assumptions,” IEEE Transactions on Software Engineering , vol. 48, no. 9, pp. 3664–3685, 2021

work page 2021

[32] [32]

Using genetic programming to build self-adaptivity into software-defined networks,

J. Li, S. Nejati, and M. Sabetzadeh, “Using genetic programming to build self-adaptivity into software-defined networks,” ACM Transac- tions on Autonomous and Adaptive Systems , vol. 19, no. 1, pp. 1–35, 2024

work page 2024

[33] [33]

Structure-based constants in genetic programming,

C. B. Veenhuis, “Structure-based constants in genetic programming,” in Progress in Artificial Intelligence: 16th Portuguese Conference on Artificial Intelligence, EPIA 2013, Angra do Heroísmo, Azores, Portugal, September 9-12, 2013. Proceedings 16 . Springer, 2013, pp. 126–137

work page 2013

[34] [34]

Harman, P

M. Harman, P. McMinn, J. T. de Souza, and S. Yoo, Search Based Software Engineering: Techniques, Taxonomy, Tutorial. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 1–59, ISBN: 978-3-642-25231-0. [Online]. Available: https://doi.org/10.1007/978-3 -642-25231-0_1

work page doi:10.1007/978-3 2012

[35] [35]

R. Poli, W. B. Langdon, and N. F. McPhee, A Field Guide to Genetic Programming. Lulu.com, 2008, ISBN: 978-1-4092-0073-4

work page 2008

[36] [36]

Evaluation of measures for statistical fault localisation and an optimising scheme,

D. Landsberg, H. Chockler, D. Kroening, and M. Lewis, “Evaluation of measures for statistical fault localisation and an optimising scheme,” in Fundamental Approaches to Software Engineering: 18th International Conference, FASE 2015, Held as Part of the European Joint Confer- ences on Theory and Practice of Software, ETAPS 2015, London, UK, April 11-18, 20...

work page 2015

[37] [37]

Molnar, Interpretable machine learning

C. Molnar, Interpretable machine learning . Lulu. com, 2020, ISBN: 979-8411463330

work page 2020

[38] [38]

Z3: An efficient smt solver,

L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, 2008, pp. 337–340

work page 2008

[39] [39]

Requirements-driven test generation for autonomous vehicles with machine learning components,

C. E. Tuncali, G. Fainekos, D. Prokhorov, H. Ito, and J. Kapinski, “Requirements-driven test generation for autonomous vehicles with machine learning components,” IEEE Transactions on Intelligent Vehi- cles, vol. 5, no. 2, pp. 265–280, 2019

work page 2019

[40] [40]

Pareto efficient multi-objective black-box test case selection for simulation-based testing,

A. Arrieta, S. Wang, U. Markiegi, A. Arruabarrena, L. Etxeberria, and G. Sagardui, “Pareto efficient multi-objective black-box test case selection for simulation-based testing,” Information and Software Tech- nology, 2019

work page 2019

[41] [41]

Mining assumptions for software components using machine learning,

K. Gaaloul, C. Menghi, S. Nejati, L. C. Briand, and D. Wolfe, “Mining assumptions for software components using machine learning,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Soft- ware Engineering, 2020, pp. 159–171

work page 2020

[42] [42]

Arch-comp 2019 category report: Falsification

G. Ernst, P. Arcaini, A. Donze, G. Fainekos, L. Mathesen, G. Pedrielli, S. Yaghoubi, Y . Yamagata, and Z. Zhang, “Arch-comp 2019 category report: Falsification.” in ARCH@ CPSIoTWeek, 2019, pp. 129–140

work page 2019

[43] [43]

Learning non- robustness using simulation-based testing: a network traffic-shaping case study,

B. A. Jodat, S. Nejati, M. Sabetzadeh, and P. Saavedra, “Learning non- robustness using simulation-based testing: a network traffic-shaping case study,” in 2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2023, pp. 386–397

work page 2023

[44] [44]

D. K. Chaturvedi, Modeling and simulation of systems using MAT- LAB® and Simulink® . CRC press, 2017, ISBN: 978-1439806722

work page 2017

[45] [45]

[Online]

(Accessed: September 2025) Navigating the do-178c certification process for airborne software. [Online]. Available: https://thecloudstra p.com/navigating-the-do-178c-certification-process/

work page 2025

[46] [46]

[Online]

(Accessed: September 2025) Autopilot online benchmark. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexchange/41 490-autopilot-demo-for-arp4754a-do-178c-and-do-331

work page 2025

[47] [47]

[Online]

(Accessed: September 2025) Beamng.tech. [Online]. Available: https://beamng.tech

work page 2025

[48] [48]

Control strategies for autonomous vehicles,

C. V . Samak, T. V . Samak, and S. Kandhasamy, “Control strategies for autonomous vehicles,” in Autonomous driving and advanced driver- assistance systems (ADAS) . CRC Press, 2021, pp. 37–86

work page 2021

[49] [49]

End to End Learning for Self-Driving Cars

M. Bojarski, D. W. del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self- driving cars,” ArXiv, vol. abs/1604.07316, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:15780954

work page internal anchor Pith review Pith/arXiv arXiv 2016

[50] [50]

[Online]

(Accessed: September 2025) Github repo for cyber-physical systems testing tool competition. [Online]. Available: https://github.com/sbft-c ps-tool-competition/cps-tool-competition

work page 2025

[51] [51]

Evaluating the impact of flaky simulators on testing autonomous driving systems,

M. H. Amini, S. Naseri, and S. Nejati, “Evaluating the impact of flaky simulators on testing autonomous driving systems,” Empirical Software Engineering, vol. 29, no. 2, pp. 1–30, 2024

work page 2024

[52] [52]

Digital twins are not monozygotic - cross-replicating ADAS testing in two industry-grade automotive simulators,

M. Borg, R. B. Abdessalem, S. Nejati, F. Jegeden, and D. Shin, “Digital twins are not monozygotic - cross-replicating ADAS testing in two industry-grade automotive simulators,” in 14th IEEE Conference on Software Testing, Verification and Validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021 . IEEE, 2021, pp. 383–393

work page 2021

[53] [53]

Practical bayesian optimiza- tion of machine learning algorithms,

J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- tion of machine learning algorithms,” Advances in neural information processing systems, vol. 25, 2012. 24

work page 2012

[54] [54]

A comparison of bloat control methods for genetic programming,

S. Luke and L. Panait, “A comparison of bloat control methods for genetic programming,” Evolutionary computation , vol. 14, no. 3, pp. 309–344, 2006

work page 2006

[55] [55]

On a test of whether one of two random variables is stochastically larger than the other,

H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947

work page 1947

[56] [56]

A critique and improvement of the cl common language effect size statistics of mcgraw and wong,

A. Vargha and H. D. Delaney, “A critique and improvement of the cl common language effect size statistics of mcgraw and wong,” Journal of Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101–132, 2000

work page 2000

[57] [57]

Controlling the false discovery rate: a practical and powerful approach to multiple testing,

Y . Benjamini and Y . Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal statistical society: series B (Methodological) , vol. 57, no. 1, pp. 289–300, 1995

work page 1995

[58] [58]

Simulation-based test case generation for unmanned aerial vehicles in the neighborhood of real flights,

S. Khatiri, S. Panichella, and P. Tonella, “Simulation-based test case generation for unmanned aerial vehicles in the neighborhood of real flights,” in2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2023, pp. 281–292

work page 2023

[59] [59]

A practical guide for using statistical tests to assess randomized algorithms in software engineering,

A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in Proceedings of the 33rd international conference on software engineering, 2011, pp. 1–10

work page 2011

[60] [60]

Automated formalization of structured natural language requirements,

D. Giannakopoulou, T. Pressburger, A. Mavridou, and J. Schumann, “Automated formalization of structured natural language requirements,” Information and Software Technology , vol. 137, p. 106590, 2021

work page 2021

[61] [61]

Evaluating model testing and model checking for finding requirements violations in simulink models,

S. Nejati, K. Gaaloul, C. Menghi, L. C. Briand, S. Foster, and D. Wolfe, “Evaluating model testing and model checking for finding requirements violations in simulink models,” in Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2019, pp. 1015– 1025

work page 2019

[62] [62]

[Online]

(Accessed: September 2025) End-to-end deep learning for self-driving cars. [Online]. Available: https://developer.nvidia.com/blog/deep-learn ing-self-driving-cars/

work page 2025

[63] [63]

The daikon system for dynamic detection of likely invariants,

M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao, “The daikon system for dynamic detection of likely invariants,” Science of computer programming , vol. 69, no. 1-3, pp. 35–45, 2007

work page 2007

[64] [64]

Test oracle assessment and improvement,

G. Jahangirova, D. Clark, M. Harman, and P. Tonella, “Test oracle assessment and improvement,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 247–258

work page 2016

[65] [65]

Evolutionary improvement of assertion oracles,

V . Terragni, G. Jahangirova, P. Tonella, and M. Pezzè, “Evolutionary improvement of assertion oracles,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2020, pp. 1178–1189

work page 2020

[66] [66]

Using semi-supervised learning for pre- dicting metamorphic relations,

B. Hardin and U. Kanewala, “Using semi-supervised learning for pre- dicting metamorphic relations,” in Proceedings of the 3rd International Workshop on Metamorphic Testing, 2018, pp. 14–17

work page 2018

[67] [67]

Using machine learning techniques to detect metamorphic relations for programs without test oracles,

U. Kanewala and J. M. Bieman, “Using machine learning techniques to detect metamorphic relations for programs without test oracles,” in 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2013, pp. 1–10

work page 2013

[68] [68]

Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels,

U. Kanewala, J. M. Bieman, and A. Ben-Hur, “Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels,” Software testing, verification and reliability , vol. 26, no. 3, pp. 245–269, 2016

work page 2016

[69] [69]

Leveraging mutants for automatic prediction of metamorphic relations using machine learning,

A. Nair, K. Meinke, and S. Eldh, “Leveraging mutants for automatic prediction of metamorphic relations using machine learning,” in Pro- ceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation , 2019, pp. 1–6

work page 2019

[70] [70]

Rbf-mlmr: A multi-label metamorphic relation prediction approach using rbf neural network,

P. Zhang, X. Zhou, P. Pelliccione, and H. Leung, “Rbf-mlmr: A multi-label metamorphic relation prediction approach using rbf neural network,” IEEE access, vol. 5, pp. 21 791–21 805, 2017

work page 2017

[71] [71]

Gen- morph: Automatically generating metamorphic relations via genetic programming,

J. Ayerdi, V . Terragni, G. Jahangirova, A. Arrieta, and P. Tonella, “Gen- morph: Automatically generating metamorphic relations via genetic programming,” IEEE Transactions on Software Engineering , 2024

work page 2024

[72] [72]

Ex- ploratory test oracle using multi-layer perceptron neural network,

W. Makondo, R. Nallanthighal, I. Mapanga, and P. Kadebu, “Ex- ploratory test oracle using multi-layer perceptron neural network,” in 2016 International Conference on Advances in Computing, Communi- cations and Informatics (ICACCI) . IEEE, 2016, pp. 1166–1171

work page 2016

[73] [73]

An automated oracle approach to test decision-making structures,

S. R. Shahamiri, W. M. N. W. Kadir, and S. bin Ibrahim, “An automated oracle approach to test decision-making structures,” in 2010 3rd International Conference on Computer Science and Information Technology, vol. 5. IEEE, 2010, pp. 30–34

work page 2010

[74] [74]

A neural net based approach to test oracle,

K. Aggarwal, Y . Singh, A. Kaur, and O. Sangwan, “A neural net based approach to test oracle,” ACM SIGSOFT Software Engineering Notes , vol. 29, no. 3, pp. 1–6, 2004

work page 2004

[75] [75]

Artificial neural network for automatic test oracles generation,

H. Jin, Y . Wang, N.-W. Chen, Z.-J. Gou, and S. Wang, “Artificial neural network for automatic test oracles generation,” in 2008 International Conference on Computer Science and Software Engineering , vol. 2. IEEE, 2008, pp. 727–730

work page 2008

[76] [76]

Performing software test oracle based on deep neural network with fuzzy inference system,

A. K. Monsefi, B. Zakeri, S. Samsam, and M. Khashehchi, “Performing software test oracle based on deep neural network with fuzzy inference system,” in High-Performance Computing and Big Data Analysis: Second International Congress, TopHPC 2019, Tehran, Iran, April 23– 25, 2019, Revised Selected Papers 2 . Springer, 2019, pp. 406–417

work page 2019

[77] [77]

Radial basis function neural network based approach to test oracle,

O. P. Sangwan, P. K. Bhatia, and Y . Singh, “Radial basis function neural network based approach to test oracle,”ACM SIGSOFT Software Engineering Notes, vol. 36, no. 5, pp. 1–5, 2011

work page 2011

[78] [78]

Automated test oracle based on neural networks,

M. Ye, B. Feng, L. Zhu, and Y . Lin, “Automated test oracle based on neural networks,” in 2006 5th IEEE International Conference on Cognitive Informatics, vol. 1. IEEE, 2006, pp. 517–522

work page 2006

[79] [79]

Automatic test oracle based on probabilistic neural networks,

R. Zhang, Y .-w. Wang, and M.-z. Zhang, “Automatic test oracle based on probabilistic neural networks,” inRecent Developments in Intelligent Computing, Communication and Devices: Proceedings of ICCD 2017 . Springer, 2019, pp. 437–445

work page 2017

[80] [80]

Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study,

J. Ayerdi, V . Terragni, A. Arrieta, P. Tonella, G. Sagardui, and M. Ar- ratibel, “Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering , 2021, pp. 1264–1274

work page 2021