Towards Safety-Aware Mutation Testing for Autonomous Driving Systems
Pith reviewed 2026-06-26 00:55 UTC · model grok-4.3
The pith
Safety-Aware Mutation Testing injects STPA-derived faults into ADS module messages to measure test adequacy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that deriving mutant generation rules directly from top-down safety engineering frameworks such as STPA and then systematically injecting temporally bounded faults into the messages exchanged between ADS modules produces mutants that represent genuine hazards, thereby embedding systems thinking into the mutation testing pipeline to evaluate test adequacy, enable automated scenario generation, and guide ADS repair.
What carries the argument
Safety-Aware Mutation Testing (SAMT), which generates mutants by injecting temporally bounded faults into inter-module messages using rules taken from STPA safety analysis.
If this is right
- Test adequacy for ADS becomes measurable by the fraction of STPA-derived message mutants killed by a given set of scenarios.
- Surviving mutants directly indicate which component interactions still require additional test scenarios.
- Automated scenario generation can target the specific message faults that remain unkilled.
- Repair efforts can focus on the interactions whose mutants are hardest to kill.
Where Pith is reading between the lines
- The same message-fault approach could be tried in other cyber-physical systems where accidents arise from component interactions rather than single-module errors.
- Practical use would require mapping STPA-derived rules to the concrete message formats and timing constraints of existing ADS simulators.
- If the method scales, it could reduce reliance on exhaustive scenario enumeration by providing a stopping rule tied to hazard coverage.
Load-bearing premise
Rules taken from top-down safety frameworks such as STPA will produce faults that represent genuine hazards and that injecting short-lived faults into messages will simulate realistic interaction failures.
What would settle it
A comparison showing that SAMT-generated mutants trigger failure modes different from those recorded in real ADS accident reports would challenge the claim.
Figures
read the original abstract
Simulation-based testing is essential for ensuring the safety of Autonomous Driving Systems (ADS), yet the community lacks a systematic criterion for determining when we can safely stop additional test scenario generation. Existing coverage metrics typically focus on individual component reliability or treat the ADS as a black box, failing to capture certain component interactions that cause most ADS accidents. While traditional mutation testing provides a falsifiable measure of test adequacy, directly porting code- and deep learning model-level mutations to the corresponding modules of ADS is insufficient. In this vision paper, we propose a paradigm shift toward Safety-Aware Mutation Testing (SAMT). Unlike traditional mutation testing, which creates mutants (i.e., faulty versions of the software under test) by injecting artificial faults into individual components, SAMT systematically injects temporally bounded faults into the messages exchanged between ADS modules to simulate realistic interaction failures. To ensure these mutants represent genuine hazards, we propose deriving mutant generation rules directly from top-down safety engineering frameworks, such as System-Theoretic Process Analysis (STPA). By embedding systems thinking into the mutation testing pipeline, SAMT provides a rigorous mechanism for evaluating test adequacy, enabling automated scenario generation, and guiding ADS repair. We also outline critical open challenges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a vision paper proposing Safety-Aware Mutation Testing (SAMT) for Autonomous Driving Systems. It argues that simulation-based testing lacks a systematic stopping criterion and that existing coverage metrics and traditional mutation testing (focused on individual components or black-box behavior) fail to capture the component interactions responsible for most ADS accidents. SAMT instead systematically injects temporally bounded faults into inter-module messages, with mutant generation rules derived from top-down safety engineering frameworks such as STPA, to simulate realistic interaction failures. The paper claims this embeds systems thinking into the mutation testing pipeline and thereby supplies a rigorous mechanism for evaluating test adequacy, enabling automated scenario generation, and guiding ADS repair, while also outlining open challenges.
Significance. If the proposed mapping from STPA to concrete, validated message mutations can be developed and shown to produce faults that represent genuine hazards, SAMT could meaningfully advance safety testing for autonomous systems by targeting interaction failures that current component-level approaches miss. This would address a recognized gap in ADS verification and potentially improve both test adequacy assessment and repair guidance.
major comments (2)
- [Abstract] Abstract (SAMT proposal paragraph): The claim that 'deriving mutant generation rules directly from ... STPA' ensures mutants 'represent genuine hazards' and that SAMT thereby 'provides a rigorous mechanism' is load-bearing for the central contribution, yet the manuscript supplies no STPA-to-mutation derivation procedure, no sample control structure or unsafe control action, and no concrete translation to a message mutation. Without such an example the assumption that the resulting faults capture genuine hazards remains unillustrated and untestable.
- [Abstract] Abstract (SAMT proposal paragraph): The paper introduces 'temporally bounded faults' into messages as the core mechanism for simulating interaction failures but provides neither a definition of temporal bounding nor an argument or illustration showing why such faults are more realistic or more effective than component-level mutations at exposing the interaction issues that cause accidents. This absence directly undermines the superiority claim over traditional mutation testing.
minor comments (1)
- The open challenges section is mentioned only in passing; expanding it with concrete research questions (e.g., how to automate the STPA-to-mutation step) would strengthen the vision paper.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our vision paper. We agree that the abstract claims would be strengthened by concrete illustrations of the proposed mapping and definitions, and we will revise the manuscript to incorporate these elements.
read point-by-point responses
-
Referee: [Abstract] Abstract (SAMT proposal paragraph): The claim that 'deriving mutant generation rules directly from ... STPA' ensures mutants 'represent genuine hazards' and that SAMT thereby 'provides a rigorous mechanism' is load-bearing for the central contribution, yet the manuscript supplies no STPA-to-mutation derivation procedure, no sample control structure or unsafe control action, and no concrete translation to a message mutation. Without such an example the assumption that the resulting faults capture genuine hazards remains unillustrated and untestable.
Authors: We acknowledge that the manuscript does not supply a detailed STPA-to-mutation derivation procedure or concrete example. As this is a vision paper proposing a new paradigm, the emphasis is on the high-level approach rather than a fully worked implementation. To address the concern, we will add an illustrative example in the revised manuscript, including a sample control structure, an unsafe control action, and its translation to a specific message mutation. This will make the proposal more concrete without altering the vision-oriented nature of the work. revision: yes
-
Referee: [Abstract] Abstract (SAMT proposal paragraph): The paper introduces 'temporally bounded faults' into messages as the core mechanism for simulating interaction failures but provides neither a definition of temporal bounding nor an argument or illustration showing why such faults are more realistic or more effective than component-level mutations at exposing the interaction issues that cause accidents. This absence directly undermines the superiority claim over traditional mutation testing.
Authors: We agree that the absence of a definition for 'temporally bounded faults' and supporting argument leaves the distinction from traditional mutation testing insufficiently clear. In the revision we will add a precise definition (faults that affect message content or timing only within a bounded temporal window) together with a short illustrative comparison showing how this targets inter-module interaction failures more directly than component-level mutations. revision: yes
Circularity Check
No circularity; vision paper with no derivations or self-referential steps
full rationale
The paper is a conceptual vision proposal for SAMT that suggests deriving mutant rules from STPA but supplies no equations, parameter fits, uniqueness theorems, or derivation chains. The abstract and description state the proposal without reducing any claim to its own inputs by construction, self-citation, or renaming. No load-bearing steps match the enumerated circularity patterns; the work is self-contained as an outline of open challenges rather than a closed derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption STPA-derived rules will produce mutants that represent genuine hazards for ADS interaction failures
invented entities (1)
-
Safety-Aware Mutation Testing (SAMT)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Finding critical scenarios for automated driving systems: A systematic mapping study,
X. Zhang, J. Tao, K. Tan, M. T ¨orngren, J. M. G. S´anchez, M. R. Ramli, X. Tao, M. Gyllenhammar, F. Wotawa, N. Mohan, M. Nica, and H. Felbinger, “Finding critical scenarios for automated driving systems: A systematic mapping study,”IEEE Transactions on Software Engi- neering, vol. 49, no. 3, pp. 991–1026, 2023
2023
-
[2]
A Survey on Scenario-Based Testing for Au- tomated Driving Systems in High-Fidelity Simulation,
Z. Zhong, Y . Tang, Y . Zhou, V . d. O. Neves, Y . Liu, and B. Ray, “A Survey on Scenario-Based Testing for Au- tomated Driving Systems in High-Fidelity Simulation,” Dec. 2021
2021
-
[3]
An Analysis and Survey of the Development of Mutation Testing,
Y . Jia and M. Harman, “An Analysis and Survey of the Development of Mutation Testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, Sep. 2011
2011
-
[4]
Mutation Testing Advances: An Anal- ysis and Survey,
M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . L. Traon, and M. Harman, “Mutation Testing Advances: An Anal- ysis and Survey,” inAdvances in Computers. Elsevier, 2019, vol. 112, pp. 275–378
2019
-
[5]
Pit: a practical mutation testing tool for java (demo),
H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: a practical mutation testing tool for java (demo),” inProceedings of the 25th International Symposium on Software Testing and Analysis, ser. ISSTA
-
[6]
New York, NY , USA: Association for Computing Machinery, 2016, p. 449–452
2016
-
[7]
A comprehensive empirical and theoretical analysis of batching algorithms for efficient, safe, parallel mutation analysis in rust,
Z. L ´evai, D. Shin, and P. McMinn, “A comprehensive empirical and theoretical analysis of batching algorithms for efficient, safe, parallel mutation analysis in rust,” ACM Trans. Softw. Eng. Methodol., Jan. 2026
2026
-
[8]
DeepMutation++: A Mutation Testing Framework for Deep Learning Systems,
Q. Hu, L. Ma, X. Xie, B. Yu, Y . Liu, and J. Zhao, “DeepMutation++: A Mutation Testing Framework for Deep Learning Systems,” in2019 34th IEEE/ACM Inter- national Conference on Automated Software Engineering (ASE). San Diego, CA, USA: IEEE, Nov. 2019, pp. 1158–1161
2019
-
[9]
Deep- Crime: mutation testing of deep learning systems based on real faults,
N. Humbatova, G. Jahangirova, and P. Tonella, “Deep- Crime: mutation testing of deep learning systems based on real faults,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analy- sis, ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, Jul. 2021, pp. 67–78
2021
-
[10]
Investigations of the software testing cou- pling effect,
A. J. Offutt, “Investigations of the software testing cou- pling effect,”ACM Trans. Softw. Eng. Methodol., vol. 1, no. 1, p. 5–20, Jan. 1992
1992
-
[11]
Are mutants a valid substitute for real faults in software testing?
R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” inProceedings of the 22nd ACM SIGSOFT International Symposium on Foun- dations of Software Engineering, ser. FSE 2014. New York, NY , USA: Association for Computing Machinery, Nov. 2014, pp. 654–665
2014
-
[12]
Mutations: How Close are they to Real Faults?
R. Gopinath, C. Jensen, and A. Groce, “Mutations: How Close are they to Real Faults?” in2014 IEEE 25th Inter- national Symposium on Software Reliability Engineering, Nov. 2014, pp. 189–200
2014
-
[13]
Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults,
M. Papadakis, D. Shin, S. Yoo, and D.-H. Bae, “Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, May 2018, pp. 537–548
2018
-
[14]
Two notions of correctness and their relation to testing,
T. A. Budd and D. Angluin, “Two notions of correctness and their relation to testing,”Acta informatica, vol. 18, no. 1, pp. 31–45, 1982
1982
-
[15]
Using program slicing to assist in the detection of equivalent mutants,
R. Hierons, M. Harman, and S. Danicic, “Using program slicing to assist in the detection of equivalent mutants,” Software Testing, Verification and Reliability, vol. 9, no. 4, pp. 233–262, 1999
1999
-
[16]
Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique,
M. Papadakis, Y . Jia, M. Harman, and Y . Le Traon, “Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique,” in2015 IEEE/ACM 37th IEEE In- ternational Conference on Software Engineering, vol. 1, May 2015, pp. 936–946
2015
-
[17]
Large Language Models for Equivalent Mutant Detection: How Far Are We?
Z. Tian, H. Shu, D. Wang, X. Cao, Y . Kamei, and J. Chen, “Large Language Models for Equivalent Mutant Detection: How Far Are We?” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY , USA: Association for Computing Machinery, Sep. 2024, pp. 1733–1745
2024
-
[18]
Property-based mutation testing,
E. Bartocci, L. Mariani, D. Ni ˇckovi´c, and D. Yadav, “Property-based mutation testing,” in2023 IEEE Con- ference on Software Testing, Verification and Validation (ICST), 2023, pp. 222–233
2023
-
[19]
End to End Learning for Self-Driving Cars
M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” 2016. [Online]. Available: https://arxiv.org/abs/1604.07316
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Baidu Apollo EM Motion Planner
H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong, “Baidu Apollo EM motion planner,”arXiv preprint arXiv:1807.08048, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Autoware on board: Enabling autonomous vehicles with embedded systems,
S. Kato, S. Takeuchi, Y . Ishiguro, Y . Ninomiya, K. Ot- suka, and T. Miyata, “Autoware on board: Enabling autonomous vehicles with embedded systems,” in2018 ACM/IEEE 9th International Conference on Cyber- Physical Systems (ICCPS). IEEE, 2018, pp. 287–296
2018
-
[22]
Robot Operating System 2: Design, archi- tecture, and uses in the wild,
S. Macenski, T. Foote, B. Gerkey, C. Lalancette, and W. Woodall, “Robot Operating System 2: Design, archi- tecture, and uses in the wild,”Science Robotics, vol. 7, no. 66, p. eabm6074, May 2022
2022
-
[23]
Deeptest: auto- mated testing of deep-neural-network-driven autonomous cars,
Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: auto- mated testing of deep-neural-network-driven autonomous cars,” inProceedings of the 40th International Confer- ence on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 303–314
2018
-
[24]
Structural test coverage criteria for deep neural networks,
Y . Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore, “Structural test coverage criteria for deep neural networks,”ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, Oct. 2019
2019
-
[25]
S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles,
T. Woodlief, F. Toledo, S. Elbaum, and M. B. Dwyer, “S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles,” inProceedings of the IEEE/ACM 46th Inter- national Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, Apr. 2024, pp. 1–13
2024
-
[26]
STARS: A Tool for Measuring Scenario Coverage When Testing Autonomous Robotic Systems,
T. Schallau, D. M ¨ackel, S. Naujokat, and F. Howar, “STARS: A Tool for Measuring Scenario Coverage When Testing Autonomous Robotic Systems,” inDependable Computing – EDCC 2024 Workshops, B. Sangchoolie, R. Adler, R. Hawkins, P. Schleiss, A. Arteconi, and A. Mancini, Eds. Cham: Springer Nature Switzerland, 2024, pp. 62–70
2024
-
[27]
Leveson,Engineering a safer world: systems thinking applied to safety, ser
N. Leveson,Engineering a safer world: systems thinking applied to safety, ser. Engineering systems. Cambridge, Mass: MIT Press, 2011
2011
-
[28]
CARLA: An Open Urban Driving Simula- tor,
A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “CARLA: An Open Urban Driving Simula- tor,” inProceedings of the 1st Annual Conference on Robot Learning. PMLR, Oct. 2017, pp. 1–16
2017
-
[29]
Search-based software test data generation: a survey,
P. McMinn, “Search-based software test data generation: a survey,”Software Testing, Verification and Reliability, vol. 14, no. 2, pp. 105–156, 2004
2004
-
[30]
Mutation based test case generation via a path selection strategy,
M. Papadakis and N. Malevris, “Mutation based test case generation via a path selection strategy,”Information and Software Technology, vol. 54, no. 9, pp. 915–932, Sep. 2012
2012
-
[31]
Achieving scalable mutation- based generation of whole test suites,
G. Fraser and A. Arcuri, “Achieving scalable mutation- based generation of whole test suites,”Empirical Soft- ware Engineering, vol. 20, no. 3, pp. 783–812, Jun. 2015
2015
-
[32]
A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems,
Y . Chen, Y . Huai, Y . He, S. Li, C. Hong, Q. A. Chen, and Joshua Garcia, “A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 380–402, Jun. 2025
2025
-
[33]
Defects4J: a database of existing faults to enable controlled testing studies for Java programs,
R. Just, D. Jalali, and M. D. Ernst, “Defects4J: a database of existing faults to enable controlled testing studies for Java programs,” inProceedings of the 2014 International Symposium on Software Testing and Analysis, ser. ISSTA
2014
-
[34]
2014, pp
New York, NY , USA: Association for Computing Machinery, Jul. 2014, pp. 437–440
2014
-
[35]
Metallaxis-FL: mutation-based fault localization,
M. Papadakis and Y . Le Traon, “Metallaxis-FL: mutation-based fault localization,”Software Testing, Ver- ification and Reliability, vol. 25, no. 5-7, pp. 605–628, 2015
2015
-
[36]
Mutation-based fault localization for real-world multilingual programs (t),
S. Hong, B. Lee, T. Kwak, Y . Jeon, B. Ko, Y . Kim, and M. Kim, “Mutation-based fault localization for real-world multilingual programs (t),” in2015 30th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE), 2015, pp. 464–475
2015
-
[37]
Mutation-based Fault Localization of Deep Neu- ral Networks,
A. Ghanbari, D.-G. Thomas, M. A. Arshad, and H. Ra- jan, “Mutation-based Fault Localization of Deep Neu- ral Networks,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). Luxembourg, Luxembourg: IEEE, Sep. 2023, pp. 1301– 1313
2023
-
[38]
ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation,
S. Feng, Y . Ye, Q. Shi, Z. Cheng, X. Xu, S. Cheng, H. Choi, and X. Zhang, “ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering. Sacramento CA USA: ACM, Oct. 2024, pp. 1620–1632
2024
-
[39]
ACA V: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings,
H. Sun, C. M. Poskitt, Y . Sun, J. Sun, and Y . Chen, “ACA V: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings,” inPro- ceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, Apr. 2024, pp. 1–13
2024
-
[40]
FIXDRIVE: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation,
Y . Sun, C. M. Poskitt, K. Wang, and J. Sun, “FIXDRIVE: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Apr. 2025, pp. 1921–1933
2025
-
[41]
On Determinism of Game Engines Used for Simulation-Based Autonomous Vehicle Verifi- cation,
G. Chance, A. Ghobrial, K. McAreavey, S. Lemaignan, T. Pipe, and K. Eder, “On Determinism of Game Engines Used for Simulation-Based Autonomous Vehicle Verifi- cation,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. 20 538–20 552, Nov. 2022
2022
-
[42]
Empirically Evaluating Flaky Tests for Autonomous Driving Systems in Simulated Environments,
O. Osikowicz, P. McMinn, and D. Shin, “Empirically Evaluating Flaky Tests for Autonomous Driving Systems in Simulated Environments,” in2025 IEEE/ACM Inter- national Flaky Tests Workshop (FTW), Apr. 2025, pp. 13–20
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.