pith. sign in

arxiv: 2508.20902 · v3 · submitted 2025-08-28 · 💻 cs.SE

Automated Test Validators for Flaky Cyber-Physical System Simulators: Approach and Evaluation

Pith reviewed 2026-05-18 20:51 UTC · model grok-4.3

classification 💻 cs.SE
keywords cyber-physical systemstest validatorsgenetic programmingspectrum-based fault localizationflaky simulatorssimulation-based testingOchiai formulaprecondition violations
0
0 comments X

The pith

Genetic programming using the Ochiai formula produces more accurate test validators for filtering ineffective inputs in flaky cyber-physical system simulators than decision trees or other formulas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops automatic test validators to skip ineffective inputs before running costly CPS simulators that may produce inconsistent results due to flakiness. Two generation methods are compared: genetic programming that treats spectrum-based fault localization formulas as fitness functions, and decision trees or decision rules. The validators target precondition violations, operational design domain limit violations, and inherently safe scenarios. Across aerospace, networking, and autonomous driving case studies, the genetic programming variant with Ochiai shows higher accuracy than the alternatives, and this edge holds even after accounting for simulator flakiness. The validators are also shown to be robust, with low accuracy variation, and most of their assertions match requirements drawn from standards and empirical literature.

Core claim

Test validators generated using genetic programming with the Ochiai spectrum-based fault localization formula are significantly more accurate than those generated using genetic programming with Tarantula and Naish or using decision trees and decision rules. This accuracy advantage remains even when accounting for the flakiness of the simulator. The validators are robust against flakiness, showing only 4 percent average variation in accuracy results across four different network and autonomous-driving systems. On average, 88.7 percent of the assertions inferred by the approach align or overlap with requirements precondition violations, ODD-limit violations, and nominal safe conditions.

What carries the argument

Genetic programming that uses spectrum-based fault localization ranking formulas, especially Ochiai, as fitness functions to evolve boolean expressions classifying test inputs as valid or invalid for simulator execution.

If this is right

  • Validators can pre-filter test inputs that violate preconditions or exceed ODD limits, avoiding unnecessary simulator runs.
  • The accuracy advantage persists despite inconsistent outcomes caused by simulator flakiness.
  • Generated assertions align closely with technical standards and empirical results from the literature.
  • Robustness is demonstrated with only 4 percent average accuracy variation across multiple flaky systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The filtering step could be inserted early in automated testing pipelines to reduce overall simulation time for large input spaces.
  • Similar repurposing of fault localization formulas might help other simulation-based domains that face high execution costs.
  • If the validators prove stable, they could enable broader sampling of critical scenarios without proportional growth in compute demand.

Load-bearing premise

Spectrum-based fault localization ranking formulas such as Ochiai can be repurposed as effective fitness functions inside genetic programming to evolve validators that correctly identify precondition violations, ODD-limit violations, and inherently safe scenarios without needing to execute the simulator.

What would settle it

A new case study on a different CPS domain where the accuracy of GP with Ochiai is not significantly higher than GP with Tarantula, Naish, or decision trees would falsify the central accuracy claim.

Figures

Figures reproduced from arXiv: 2508.20902 by Baharin A. Jodat, Khouloud Gaaloul, Mehrdad Sabetzadeh, Shiva Nejati.

Figure 1
Figure 1. Figure 1: Workflow for verdict assignment using (a) test oracles based on system execution, (b) assertion-based test oracles which [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An assertion-based test oracle for a simplified ADS with inputs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our approach for deriving assertion-based test oracles [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Syntactic rules of the grammar (denoted by [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fitness functions for our GP-based condition inference [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Computing 𝑐 𝑝 (𝑐) and 𝑐 𝑓 (𝑐) in SBFL fitness functions in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pruning inconsistent assertions from test oracles using a [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) An example set of test inputs for a (signal-based) autopilot system, with input signals [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Step-by-step illustration of using rules to derive logical assertion conditions over the signals in Figure 9 from assertion [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 6
Figure 6. Figure 6: Naish denoted by GPN, Tarantula denoted by GPT , and Ochiai denoted by GPO. To account for the randomness of GP, DT, and DR, we apply each technique 20 times to the training set for each case study. In addition to considering the test oracle generation meth￾ods individually, we also consider an ensemble approach. Specifically, for each run of GPN, GPT , GPO, DT, and DR, the ensemble method computes the uni… view at source ↗
Figure 11
Figure 11. Figure 11: Illustrations of the percentage of unique correct [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pass-as-Fail rates of the test oracles generated by [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average relative accuracies of the test oracles gener [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Trade-off between relative accuracy and inconclusive [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
read the original abstract

Simulation-based testing of cyber-physical systems (CPS) is costly due to the time-consuming execution of CPS simulators. In addition, CPS simulators may be flaky, leading to inconsistent test outcomes and requiring repeated test re-execution for reliable test verdicts. Many test inputs within the input space of CPS may not effectively exercise the behaviour of the system under test (SUT) -- for instance, those that violate system preconditions, exceed operational design domain (ODD) limits, or represent inherently safe scenarios. In this article, we propose to use test validators to filter out such test inputs before execution. We describe two methods for generating test validators: one using genetic programming (GP) that employs well-known spectrum-based fault localization (SBFL) ranking formulas, namely Ochiai, Tarantula, and Naish, as fitness functions; and the other using decision trees (DT) and decision rules (DR). We evaluate our test validators through case studies in the domains of aerospace, networking and autonomous driving. We show that test validators generated using GP with Ochiai are significantly more accurate than those generated using GP with Tarantula and Naish or using DT or DR. Moreover, this accuracy advantage remains even when accounting for the flakiness of the simulator. We further show that our test validators generated by GP with Ochiai are robust against flakiness with only 4% average variation in their accuracy results across four different network and autonomous-driving systems with flaky behaviours. Finally, we show that, on average, 88.7% of the assertions inferred by our approach align or overlap with requirements precondition violations, ODD-limit violations, and nominal safe conditions extracted from technical standards and empirical results in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes automated generation of test validators to filter ineffective inputs (precondition violations, ODD-limit violations, inherently safe scenarios) for flaky CPS simulators, thereby reducing execution costs. Two generation methods are described: genetic programming (GP) that repurposes spectrum-based fault localization (SBFL) formulas (Ochiai, Tarantula, Naish) as fitness functions, and decision trees/rules (DT/DR). Evaluation on aerospace, networking, and autonomous-driving case studies claims that GP with Ochiai yields significantly higher accuracy than the alternatives, that this advantage persists under simulator flakiness, that accuracy varies only 4% on average across four flaky systems, and that 88.7% of inferred assertions align with literature-derived requirements.

Significance. If the central empirical claims hold after rigorous statistical validation and clearer method exposition, the work could meaningfully lower the cost of simulation-based CPS testing in safety-critical domains. The explicit handling of flakiness and the reported alignment with external standards are practical strengths. The approach also offers a novel transfer of SBFL techniques into test-input filtering, which could be extended if the mapping from coverage spectra to validator fitness is shown to be general rather than artifactual.

major comments (3)
  1. [Abstract and §5] Abstract and §5 (Evaluation): the repeated claim that GP-Ochiai validators are 'significantly more accurate' is unsupported by any statistical test, confidence interval, effect size, or raw-data summary. The reported accuracy advantage and the 4% flakiness-variation figure therefore remain descriptive rather than inferential, weakening the central comparative result.
  2. [§3.2] §3.2 (GP fitness function definition): the mapping from SBFL spectra (counts a, b, c, d) to a fitness function over candidate validator predicates or features is not explicitly constructed. Without this definition it is unclear why Ochiai, Tarantula, or Naish remain meaningful outside their original code-coverage setting or whether the reported superiority is an artifact of the particular feature encoding and labeling scheme.
  3. [§4 and §5] §4 and §5: the experimental protocol (number of GP runs, population size, termination criteria, how flakiness is injected and measured, train/test split for validator accuracy) is not fully specified, preventing independent reproduction or assessment of robustness claims.
minor comments (2)
  1. [Figures 4-7] Table captions and axis labels in the accuracy and robustness plots should explicitly state the number of independent runs and the exact accuracy metric (e.g., precision, recall, F1) used.
  2. [§5.3] The 88.7% alignment figure would benefit from a breakdown by domain and by type of violation (precondition vs. ODD vs. safe scenario) to show whether the result is uniform.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where revisions will strengthen the manuscript and where we provide additional clarification.

read point-by-point responses
  1. Referee: [Abstract and §5] the repeated claim that GP-Ochiai validators are 'significantly more accurate' is unsupported by any statistical test, confidence interval, effect size, or raw-data summary. The reported accuracy advantage and the 4% flakiness-variation figure therefore remain descriptive rather than inferential.

    Authors: We agree that the term 'significantly' was used in a descriptive sense in the current draft. In the revised version we will replace this with inferential statistics: we will report results from 30 independent GP runs, apply the Wilcoxon signed-rank test with p-values, compute Cohen's d effect sizes, and include 95% confidence intervals for the accuracy differences. Raw per-run accuracy tables will be added to an appendix or supplementary material. The 4% variation figure will similarly be accompanied by standard deviation and range across the four systems. revision: yes

  2. Referee: [§3.2] the mapping from SBFL spectra (counts a, b, c, d) to a fitness function over candidate validator predicates or features is not explicitly constructed. Without this definition it is unclear why Ochiai, Tarantula, or Naish remain meaningful outside their original code-coverage setting.

    Authors: The fitness function is obtained by treating each candidate validator predicate as a binary classifier over the set of executed test inputs: a = number of effective inputs where the predicate evaluates true, b = number of ineffective inputs where it evaluates true, c = number of effective inputs where it evaluates false, d = number of ineffective inputs where it evaluates false. The SBFL formula is then applied directly to these four counts to produce the fitness value. We will insert an explicit equation and a short paragraph in §3.2 that defines this mapping and explains why the formulas remain semantically meaningful when the 'spectrum' is derived from input-effectiveness labels rather than statement coverage. revision: yes

  3. Referee: [§4 and §5] the experimental protocol (number of GP runs, population size, termination criteria, how flakiness is injected and measured, train/test split for validator accuracy) is not fully specified, preventing independent reproduction or assessment of robustness claims.

    Authors: We acknowledge that several parameter values and procedural steps were described at a high level. In the revision we will expand both sections with the following concrete details: 30 independent GP runs per configuration, population size of 100, tournament selection, 100-generation limit or fitness convergence of 0.01, flakiness injection via Gaussian noise on simulator outputs with variance calibrated to observed real-world flakiness rates, accuracy measured on a held-out 30% test set after training on 70%, and explicit random-seed reporting. A new subsection will tabulate all hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of GP-based test validators

full rationale

The paper's claims rest on comparative experiments across aerospace, networking, and autonomous-driving case studies, measuring validator accuracy against ground-truth labels for precondition/ODD/safe-scenario violations and checking robustness to simulator flakiness. SBFL formulas are adopted as GP fitness functions via an explicit methodological choice, with performance differences reported empirically rather than derived by construction from the evaluation data itself. Alignment with literature requirements (88.7% overlap) serves as an external validation step, not a definitional input. No equations, self-citations, or renamings reduce the reported accuracy advantages or robustness figures to tautological re-expressions of the same fitted quantities or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard domain assumptions about CPS testing and simulator behavior rather than new free parameters or invented entities.

axioms (2)
  • domain assumption CPS simulators can produce inconsistent outcomes for the same input due to flakiness, requiring repeated executions for reliable verdicts.
    Explicitly stated as a core motivation in the abstract.
  • domain assumption A substantial fraction of test inputs violate preconditions, exceed ODD limits, or represent inherently safe scenarios and can therefore be filtered without execution.
    Central premise justifying the use of validators.

pith-pipeline@v0.9.0 · 5860 in / 1478 out tokens · 41471 ms · 2026-05-18T20:51:04.896745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong

    cs.SE 2026-04 unverdicted novelty 5.0

    A grammar-constrained counterfactual refinement framework resolves inconsistencies in safety operational rules for an autonomous driving system while staying syntactically valid.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    [Online]

    (Accessed: September 2025) Raquel urtasun’s tech company develops self-driving vehicle simulator. [Online]. Available: https://www.thestar. com/business/raquel-urtasun-s-tech-company-develops-self-driving-v ehicle-simulator/article_4fc552f3-cbec-523c-ad3a-ec6aa93cdad7.html

  2. [2]

    Machine learning-based test selection for simulation-based testing of self-driving cars software,

    C. Birchler, S. Khatiri, B. Bosshard, A. Gambi, and S. Panichella, “Machine learning-based test selection for simulation-based testing of self-driving cars software,” Empirical Software Engineering , vol. 28, no. 3, p. 71, 2023

  3. [3]

    Salvo: Automated generation of diversified tests for self-driving cars from existing maps,

    V . Nguyen, S. Huber, and A. Gambi, “Salvo: Automated generation of diversified tests for self-driving cars from existing maps,” in2021 IEEE International Conference on Artificial Intelligence Testing (AITest) . IEEE, 2021, pp. 128–135

  4. [4]

    Simulation-based testing of unmanned aerial vehicles with aerialist,

    S. Khatiri, S. Panichella, and P. Tonella, “Simulation-based testing of unmanned aerial vehicles with aerialist,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 134–138

  5. [5]

    An empirical analysis of flaky tests,

    Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” inProceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653

  6. [6]

    A survey of flaky tests,

    O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “A survey of flaky tests,” ACM Transactions on Software Engineering and Method- ology (TOSEM), vol. 31, no. 1, pp. 1–74, 2021

  7. [7]

    Constructing automated test oracle for low observable software,

    M. Valueian, N. Attar, H. Haghighi, and M. Vahidi-Asl, “Constructing automated test oracle for low observable software,” Scientia Iranica , vol. 27, no. 3, pp. 1333–1351, 2020

  8. [8]

    Using a neural network in the software testing process,

    M. Vanmali, M. Last, and A. Kandel, “Using a neural network in the software testing process,” International Journal of Intelligent Systems , vol. 17, no. 1, pp. 45–62, 2002

  9. [9]

    An automated framework for software test oracle,

    S. R. Shahamiri, W. M. N. W. Kadir, S. Ibrahim, and S. Z. M. Hashim, “An automated framework for software test oracle,” Information and Software Technology, vol. 53, no. 7, pp. 774–788, 2011. 23

  10. [10]

    Artificial neural networks as multi-networks automated test oracle,

    S. R. Shahamiri, W. M. Wan-Kadir, S. Ibrahim, and S. Z. M. Hashim, “Artificial neural networks as multi-networks automated test oracle,” Automated Software Engineering , vol. 19, pp. 303–334, 2012

  11. [11]

    An approach to design test oracle for aspect oriented software systems using soft computing approach,

    A. Singhal, A. Bansal, and A. Kumar, “An approach to design test oracle for aspect oriented software systems using soft computing approach,” International Journal of System Assurance Engineering and Management, vol. 7, pp. 1–5, 2016

  12. [12]

    A classifier-based test oracle for embedded software,

    F. Gholami, N. Attar, H. Haghighi, M. V . Asl, M. Valueian, and S. Mo- hamadyari, “A classifier-based test oracle for embedded software,” in 2018 Real-Time and Embedded Systems and Technologies (RTEST) . IEEE, 2018, pp. 104–111

  13. [13]

    A machine learning approach to generate test oracles,

    R. Braga, P. S. Neto, R. Rabêlo, J. Santiago, and M. Souza, “A machine learning approach to generate test oracles,” in Proceedings of the XXXII Brazilian Symposium on Software Engineering , 2018, pp. 142–151

  14. [14]

    Human-in-the-loop automatic program repair,

    C. Geethal, M. Böhme, and V .-T. Pham, “Human-in-the-loop automatic program repair,” IEEE Transactions on Software Engineering , 2023

  15. [15]

    On the accuracy of spectrum-based fault localization,

    R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” in Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART- MUTATION 2007). IEEE, 2007, pp. 89–98

  16. [16]

    Empirical evaluation of the tarantula automatic fault-localization technique,

    J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantula automatic fault-localization technique,” in Proceedings of the 20th IEEE/ACM international Conference on Automated software engineer- ing, 2005, pp. 273–282

  17. [17]

    A model for spectra- based software diagnosis,

    L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra- based software diagnosis,” ACM Transactions on software engineering and methodology (TOSEM) , vol. 20, no. 3, pp. 1–32, 2011

  18. [18]

    Localizing multiple faults in simulink models,

    B. Liu, Lucia, S. Nejati, L. C. Briand, and T. Bruckmann, “Localizing multiple faults in simulink models,” in IEEE 23rd International Con- ference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1 . IEEE Computer Society, 2016, pp. 146–156

  19. [19]

    Monitoring temporal properties of con- tinuous signals,

    O. Maler and D. Nickovic, “Monitoring temporal properties of con- tinuous signals,” in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems. Springer, 2004, pp. 152–166

  20. [20]

    [Online]

    (Accessed: September 2025) Lockheed martin. [Online]. Available: https://www.lockheedmartin.com

  21. [21]

    Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,

    C. Menghi, S. Nejati, K. Gaaloul, and L. C. Briand, “Generating automated and online test oracles for simulink models with continuous and uncertain behaviors,” in Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering , 2019, pp. 27–38

  22. [22]

    [Online]

    (Accessed: September 2025) Cruise control test generation. [Online]. Available: https://www.mathworks.com/help/sldv/ug/cruise-control-tes t-generation.html

  23. [23]

    [Online]

    (Accessed: September 2025) Building a clutch lock-up model. [Online]. Available: https://www.mathworks.com/help/simulink/slref/ building-a-clutch-lock-up-model.html

  24. [24]

    [Online]

    (Accessed: September 2025) Design a guidance system in matlab and simulink. [Online]. Available: https://www.mathworks.com/help/simul ink/slref/designing-a-guidance-system-in-matlab-and-simulink.html

  25. [25]

    [Online]

    (Accessed: September 2025) Dc motor model simulink model. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexc hange/11587-dc-motor-model-simulink

  26. [26]

    Arch- comp 2024 category report: Falsification,

    T. Khandait, F. Formica, P. Arcaini, S. Chotaliya, G. Fainekos, A. Hekal, A. Kundu, E. Lew, M. Loreti, C. Menghi et al. , “Arch- comp 2024 category report: Falsification,” in Proceedings of the 11th Int. Workshop on Applied , vol. 103, 2024, pp. 122–144

  27. [27]

    [Online]

    (Accessed: September 2025) Replication package for the article. [Online]. Available: https://doi.org/10.5281/zenodo.16912908

  28. [28]

    The oracle problem in software testing: A survey,

    E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,”Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2015

  29. [29]

    Luke, Essentials of Metaheuristics , 2nd ed

    S. Luke, Essentials of Metaheuristics , 2nd ed. Lulu, 2013, available for free at http://cs.gmu.edu/ ∼sean/book/metaheuristics/

  30. [30]

    Test generation strategies for building failure models and explaining spurious failures,

    B. A. Jodat, A. Chandar, S. Nejati, and M. Sabetzadeh, “Test generation strategies for building failure models and explaining spurious failures,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, pp. 1–32, 2024

  31. [31]

    Combining genetic programming and model checking to generate en- vironment assumptions,

    K. Gaaloul, C. Menghi, S. Nejati, L. C. Briand, and Y . I. Parache, “Combining genetic programming and model checking to generate en- vironment assumptions,” IEEE Transactions on Software Engineering , vol. 48, no. 9, pp. 3664–3685, 2021

  32. [32]

    Using genetic programming to build self-adaptivity into software-defined networks,

    J. Li, S. Nejati, and M. Sabetzadeh, “Using genetic programming to build self-adaptivity into software-defined networks,” ACM Transac- tions on Autonomous and Adaptive Systems , vol. 19, no. 1, pp. 1–35, 2024

  33. [33]

    Structure-based constants in genetic programming,

    C. B. Veenhuis, “Structure-based constants in genetic programming,” in Progress in Artificial Intelligence: 16th Portuguese Conference on Artificial Intelligence, EPIA 2013, Angra do Heroísmo, Azores, Portugal, September 9-12, 2013. Proceedings 16 . Springer, 2013, pp. 126–137

  34. [34]

    Harman, P

    M. Harman, P. McMinn, J. T. de Souza, and S. Yoo, Search Based Software Engineering: Techniques, Taxonomy, Tutorial. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 1–59, ISBN: 978-3-642-25231-0. [Online]. Available: https://doi.org/10.1007/978-3 -642-25231-0_1

  35. [35]

    R. Poli, W. B. Langdon, and N. F. McPhee, A Field Guide to Genetic Programming. Lulu.com, 2008, ISBN: 978-1-4092-0073-4

  36. [36]

    Evaluation of measures for statistical fault localisation and an optimising scheme,

    D. Landsberg, H. Chockler, D. Kroening, and M. Lewis, “Evaluation of measures for statistical fault localisation and an optimising scheme,” in Fundamental Approaches to Software Engineering: 18th International Conference, FASE 2015, Held as Part of the European Joint Confer- ences on Theory and Practice of Software, ETAPS 2015, London, UK, April 11-18, 20...

  37. [37]

    Molnar, Interpretable machine learning

    C. Molnar, Interpretable machine learning . Lulu. com, 2020, ISBN: 979-8411463330

  38. [38]

    Z3: An efficient smt solver,

    L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, 2008, pp. 337–340

  39. [39]

    Requirements-driven test generation for autonomous vehicles with machine learning components,

    C. E. Tuncali, G. Fainekos, D. Prokhorov, H. Ito, and J. Kapinski, “Requirements-driven test generation for autonomous vehicles with machine learning components,” IEEE Transactions on Intelligent Vehi- cles, vol. 5, no. 2, pp. 265–280, 2019

  40. [40]

    Pareto efficient multi-objective black-box test case selection for simulation-based testing,

    A. Arrieta, S. Wang, U. Markiegi, A. Arruabarrena, L. Etxeberria, and G. Sagardui, “Pareto efficient multi-objective black-box test case selection for simulation-based testing,” Information and Software Tech- nology, 2019

  41. [41]

    Mining assumptions for software components using machine learning,

    K. Gaaloul, C. Menghi, S. Nejati, L. C. Briand, and D. Wolfe, “Mining assumptions for software components using machine learning,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Soft- ware Engineering, 2020, pp. 159–171

  42. [42]

    Arch-comp 2019 category report: Falsification

    G. Ernst, P. Arcaini, A. Donze, G. Fainekos, L. Mathesen, G. Pedrielli, S. Yaghoubi, Y . Yamagata, and Z. Zhang, “Arch-comp 2019 category report: Falsification.” in ARCH@ CPSIoTWeek, 2019, pp. 129–140

  43. [43]

    Learning non- robustness using simulation-based testing: a network traffic-shaping case study,

    B. A. Jodat, S. Nejati, M. Sabetzadeh, and P. Saavedra, “Learning non- robustness using simulation-based testing: a network traffic-shaping case study,” in 2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2023, pp. 386–397

  44. [44]

    D. K. Chaturvedi, Modeling and simulation of systems using MAT- LAB® and Simulink® . CRC press, 2017, ISBN: 978-1439806722

  45. [45]

    [Online]

    (Accessed: September 2025) Navigating the do-178c certification process for airborne software. [Online]. Available: https://thecloudstra p.com/navigating-the-do-178c-certification-process/

  46. [46]

    [Online]

    (Accessed: September 2025) Autopilot online benchmark. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexchange/41 490-autopilot-demo-for-arp4754a-do-178c-and-do-331

  47. [47]

    [Online]

    (Accessed: September 2025) Beamng.tech. [Online]. Available: https://beamng.tech

  48. [48]

    Control strategies for autonomous vehicles,

    C. V . Samak, T. V . Samak, and S. Kandhasamy, “Control strategies for autonomous vehicles,” in Autonomous driving and advanced driver- assistance systems (ADAS) . CRC Press, 2021, pp. 37–86

  49. [49]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. W. del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self- driving cars,” ArXiv, vol. abs/1604.07316, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:15780954

  50. [50]

    [Online]

    (Accessed: September 2025) Github repo for cyber-physical systems testing tool competition. [Online]. Available: https://github.com/sbft-c ps-tool-competition/cps-tool-competition

  51. [51]

    Evaluating the impact of flaky simulators on testing autonomous driving systems,

    M. H. Amini, S. Naseri, and S. Nejati, “Evaluating the impact of flaky simulators on testing autonomous driving systems,” Empirical Software Engineering, vol. 29, no. 2, pp. 1–30, 2024

  52. [52]

    Digital twins are not monozygotic - cross-replicating ADAS testing in two industry-grade automotive simulators,

    M. Borg, R. B. Abdessalem, S. Nejati, F. Jegeden, and D. Shin, “Digital twins are not monozygotic - cross-replicating ADAS testing in two industry-grade automotive simulators,” in 14th IEEE Conference on Software Testing, Verification and Validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021 . IEEE, 2021, pp. 383–393

  53. [53]

    Practical bayesian optimiza- tion of machine learning algorithms,

    J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- tion of machine learning algorithms,” Advances in neural information processing systems, vol. 25, 2012. 24

  54. [54]

    A comparison of bloat control methods for genetic programming,

    S. Luke and L. Panait, “A comparison of bloat control methods for genetic programming,” Evolutionary computation , vol. 14, no. 3, pp. 309–344, 2006

  55. [55]

    On a test of whether one of two random variables is stochastically larger than the other,

    H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947

  56. [56]

    A critique and improvement of the cl common language effect size statistics of mcgraw and wong,

    A. Vargha and H. D. Delaney, “A critique and improvement of the cl common language effect size statistics of mcgraw and wong,” Journal of Educational and Behavioral Statistics , vol. 25, no. 2, pp. 101–132, 2000

  57. [57]

    Controlling the false discovery rate: a practical and powerful approach to multiple testing,

    Y . Benjamini and Y . Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal statistical society: series B (Methodological) , vol. 57, no. 1, pp. 289–300, 1995

  58. [58]

    Simulation-based test case generation for unmanned aerial vehicles in the neighborhood of real flights,

    S. Khatiri, S. Panichella, and P. Tonella, “Simulation-based test case generation for unmanned aerial vehicles in the neighborhood of real flights,” in2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2023, pp. 281–292

  59. [59]

    A practical guide for using statistical tests to assess randomized algorithms in software engineering,

    A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in Proceedings of the 33rd international conference on software engineering, 2011, pp. 1–10

  60. [60]

    Automated formalization of structured natural language requirements,

    D. Giannakopoulou, T. Pressburger, A. Mavridou, and J. Schumann, “Automated formalization of structured natural language requirements,” Information and Software Technology , vol. 137, p. 106590, 2021

  61. [61]

    Evaluating model testing and model checking for finding requirements violations in simulink models,

    S. Nejati, K. Gaaloul, C. Menghi, L. C. Briand, S. Foster, and D. Wolfe, “Evaluating model testing and model checking for finding requirements violations in simulink models,” in Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2019, pp. 1015– 1025

  62. [62]

    [Online]

    (Accessed: September 2025) End-to-end deep learning for self-driving cars. [Online]. Available: https://developer.nvidia.com/blog/deep-learn ing-self-driving-cars/

  63. [63]

    The daikon system for dynamic detection of likely invariants,

    M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao, “The daikon system for dynamic detection of likely invariants,” Science of computer programming , vol. 69, no. 1-3, pp. 35–45, 2007

  64. [64]

    Test oracle assessment and improvement,

    G. Jahangirova, D. Clark, M. Harman, and P. Tonella, “Test oracle assessment and improvement,” in Proceedings of the 25th international symposium on software testing and analysis , 2016, pp. 247–258

  65. [65]

    Evolutionary improvement of assertion oracles,

    V . Terragni, G. Jahangirova, P. Tonella, and M. Pezzè, “Evolutionary improvement of assertion oracles,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2020, pp. 1178–1189

  66. [66]

    Using semi-supervised learning for pre- dicting metamorphic relations,

    B. Hardin and U. Kanewala, “Using semi-supervised learning for pre- dicting metamorphic relations,” in Proceedings of the 3rd International Workshop on Metamorphic Testing, 2018, pp. 14–17

  67. [67]

    Using machine learning techniques to detect metamorphic relations for programs without test oracles,

    U. Kanewala and J. M. Bieman, “Using machine learning techniques to detect metamorphic relations for programs without test oracles,” in 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2013, pp. 1–10

  68. [68]

    Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels,

    U. Kanewala, J. M. Bieman, and A. Ben-Hur, “Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels,” Software testing, verification and reliability , vol. 26, no. 3, pp. 245–269, 2016

  69. [69]

    Leveraging mutants for automatic prediction of metamorphic relations using machine learning,

    A. Nair, K. Meinke, and S. Eldh, “Leveraging mutants for automatic prediction of metamorphic relations using machine learning,” in Pro- ceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation , 2019, pp. 1–6

  70. [70]

    Rbf-mlmr: A multi-label metamorphic relation prediction approach using rbf neural network,

    P. Zhang, X. Zhou, P. Pelliccione, and H. Leung, “Rbf-mlmr: A multi-label metamorphic relation prediction approach using rbf neural network,” IEEE access, vol. 5, pp. 21 791–21 805, 2017

  71. [71]

    Gen- morph: Automatically generating metamorphic relations via genetic programming,

    J. Ayerdi, V . Terragni, G. Jahangirova, A. Arrieta, and P. Tonella, “Gen- morph: Automatically generating metamorphic relations via genetic programming,” IEEE Transactions on Software Engineering , 2024

  72. [72]

    Ex- ploratory test oracle using multi-layer perceptron neural network,

    W. Makondo, R. Nallanthighal, I. Mapanga, and P. Kadebu, “Ex- ploratory test oracle using multi-layer perceptron neural network,” in 2016 International Conference on Advances in Computing, Communi- cations and Informatics (ICACCI) . IEEE, 2016, pp. 1166–1171

  73. [73]

    An automated oracle approach to test decision-making structures,

    S. R. Shahamiri, W. M. N. W. Kadir, and S. bin Ibrahim, “An automated oracle approach to test decision-making structures,” in 2010 3rd International Conference on Computer Science and Information Technology, vol. 5. IEEE, 2010, pp. 30–34

  74. [74]

    A neural net based approach to test oracle,

    K. Aggarwal, Y . Singh, A. Kaur, and O. Sangwan, “A neural net based approach to test oracle,” ACM SIGSOFT Software Engineering Notes , vol. 29, no. 3, pp. 1–6, 2004

  75. [75]

    Artificial neural network for automatic test oracles generation,

    H. Jin, Y . Wang, N.-W. Chen, Z.-J. Gou, and S. Wang, “Artificial neural network for automatic test oracles generation,” in 2008 International Conference on Computer Science and Software Engineering , vol. 2. IEEE, 2008, pp. 727–730

  76. [76]

    Performing software test oracle based on deep neural network with fuzzy inference system,

    A. K. Monsefi, B. Zakeri, S. Samsam, and M. Khashehchi, “Performing software test oracle based on deep neural network with fuzzy inference system,” in High-Performance Computing and Big Data Analysis: Second International Congress, TopHPC 2019, Tehran, Iran, April 23– 25, 2019, Revised Selected Papers 2 . Springer, 2019, pp. 406–417

  77. [77]

    Radial basis function neural network based approach to test oracle,

    O. P. Sangwan, P. K. Bhatia, and Y . Singh, “Radial basis function neural network based approach to test oracle,”ACM SIGSOFT Software Engineering Notes, vol. 36, no. 5, pp. 1–5, 2011

  78. [78]

    Automated test oracle based on neural networks,

    M. Ye, B. Feng, L. Zhu, and Y . Lin, “Automated test oracle based on neural networks,” in 2006 5th IEEE International Conference on Cognitive Informatics, vol. 1. IEEE, 2006, pp. 517–522

  79. [79]

    Automatic test oracle based on probabilistic neural networks,

    R. Zhang, Y .-w. Wang, and M.-z. Zhang, “Automatic test oracle based on probabilistic neural networks,” inRecent Developments in Intelligent Computing, Communication and Devices: Proceedings of ICCD 2017 . Springer, 2019, pp. 437–445

  80. [80]

    Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study,

    J. Ayerdi, V . Terragni, A. Arrieta, P. Tonella, G. Sagardui, and M. Ar- ratibel, “Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering , 2021, pp. 1264–1274

Showing first 80 references.