Towards Safety-Aware Mutation Testing for Autonomous Driving Systems

Donghwan Shin

arxiv: 2606.26456 · v1 · pith:JPJRV2V4new · submitted 2026-06-24 · 💻 cs.SE

Towards Safety-Aware Mutation Testing for Autonomous Driving Systems

Donghwan Shin This is my paper

Pith reviewed 2026-06-26 00:55 UTC · model grok-4.3

classification 💻 cs.SE

keywords autonomous driving systemsmutation testingsafety analysisSTPAtest adequacysimulation-based testingsystem safetyinteraction faults

0 comments

The pith

Safety-Aware Mutation Testing injects STPA-derived faults into ADS module messages to measure test adequacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Safety-Aware Mutation Testing (SAMT) to give a clear stopping criterion for generating test scenarios in autonomous driving systems. Existing coverage metrics check single components or the whole system as a black box and miss the interactions that cause most accidents. SAMT creates mutants by inserting short-lived faults into the messages that pass between modules, with the fault patterns taken from a top-down safety method such as STPA. A sympathetic reader would care because this supplies a falsifiable adequacy measure that traditional code or model mutations lack. If the method holds, testing can stop when the generated mutants are killed, new scenarios can be produced automatically, and repairs can be guided by the surviving mutants.

Core claim

The paper claims that deriving mutant generation rules directly from top-down safety engineering frameworks such as STPA and then systematically injecting temporally bounded faults into the messages exchanged between ADS modules produces mutants that represent genuine hazards, thereby embedding systems thinking into the mutation testing pipeline to evaluate test adequacy, enable automated scenario generation, and guide ADS repair.

What carries the argument

Safety-Aware Mutation Testing (SAMT), which generates mutants by injecting temporally bounded faults into inter-module messages using rules taken from STPA safety analysis.

If this is right

Test adequacy for ADS becomes measurable by the fraction of STPA-derived message mutants killed by a given set of scenarios.
Surviving mutants directly indicate which component interactions still require additional test scenarios.
Automated scenario generation can target the specific message faults that remain unkilled.
Repair efforts can focus on the interactions whose mutants are hardest to kill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same message-fault approach could be tried in other cyber-physical systems where accidents arise from component interactions rather than single-module errors.
Practical use would require mapping STPA-derived rules to the concrete message formats and timing constraints of existing ADS simulators.
If the method scales, it could reduce reliance on exhaustive scenario enumeration by providing a stopping rule tied to hazard coverage.

Load-bearing premise

Rules taken from top-down safety frameworks such as STPA will produce faults that represent genuine hazards and that injecting short-lived faults into messages will simulate realistic interaction failures.

What would settle it

A comparison showing that SAMT-generated mutants trigger failure modes different from those recorded in real ADS accident reports would challenge the claim.

Figures

Figures reproduced from arXiv: 2606.26456 by Donghwan Shin.

read the original abstract

Simulation-based testing is essential for ensuring the safety of Autonomous Driving Systems (ADS), yet the community lacks a systematic criterion for determining when we can safely stop additional test scenario generation. Existing coverage metrics typically focus on individual component reliability or treat the ADS as a black box, failing to capture certain component interactions that cause most ADS accidents. While traditional mutation testing provides a falsifiable measure of test adequacy, directly porting code- and deep learning model-level mutations to the corresponding modules of ADS is insufficient. In this vision paper, we propose a paradigm shift toward Safety-Aware Mutation Testing (SAMT). Unlike traditional mutation testing, which creates mutants (i.e., faulty versions of the software under test) by injecting artificial faults into individual components, SAMT systematically injects temporally bounded faults into the messages exchanged between ADS modules to simulate realistic interaction failures. To ensure these mutants represent genuine hazards, we propose deriving mutant generation rules directly from top-down safety engineering frameworks, such as System-Theoretic Process Analysis (STPA). By embedding systems thinking into the mutation testing pipeline, SAMT provides a rigorous mechanism for evaluating test adequacy, enabling automated scenario generation, and guiding ADS repair. We also outline critical open challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vision paper proposes STPA-guided message mutations for ADS but gives no rules, examples, or evidence that they capture real hazards.

read the letter

The paper's core idea is to move mutation testing for autonomous driving systems from component faults to message exchanges between modules, with the mutation rules coming from STPA safety analysis. This specific combination is not in the prior work the abstract cites.

It correctly flags that interaction failures drive most ADS accidents and that standard coverage metrics miss them. Framing mutation testing around systems-level hazards is a reasonable direction to explore.

The problem is that the paper stops at the proposal. It states that mutant rules should be derived from STPA unsafe control actions and that faults should be temporally bounded, but it supplies no mapping procedure, no sample control structure, no example mutation, and no definition of temporal bounding. The assertion that this approach will deliver a rigorous adequacy measure or guide repair therefore rests on the untested assumption that STPA-derived message faults will behave like genuine interaction failures.

Because the work is explicitly a vision paper, the absence of experiments or artifacts is expected. Still, the central claim cannot be evaluated without at least one worked example showing how an STPA output becomes a concrete mutant.

This is for researchers already focused on ADS verification and safety engineering who want to consider systems-theoretic testing ideas. It could prompt useful discussion in a reading group about how to make the rules operational.

I would not cite it in its current form. It is worth sending to peer review so the authors can receive concrete feedback on turning the high-level framing into usable operators.

Referee Report

2 major / 1 minor

Summary. The paper is a vision paper proposing Safety-Aware Mutation Testing (SAMT) for Autonomous Driving Systems. It argues that simulation-based testing lacks a systematic stopping criterion and that existing coverage metrics and traditional mutation testing (focused on individual components or black-box behavior) fail to capture the component interactions responsible for most ADS accidents. SAMT instead systematically injects temporally bounded faults into inter-module messages, with mutant generation rules derived from top-down safety engineering frameworks such as STPA, to simulate realistic interaction failures. The paper claims this embeds systems thinking into the mutation testing pipeline and thereby supplies a rigorous mechanism for evaluating test adequacy, enabling automated scenario generation, and guiding ADS repair, while also outlining open challenges.

Significance. If the proposed mapping from STPA to concrete, validated message mutations can be developed and shown to produce faults that represent genuine hazards, SAMT could meaningfully advance safety testing for autonomous systems by targeting interaction failures that current component-level approaches miss. This would address a recognized gap in ADS verification and potentially improve both test adequacy assessment and repair guidance.

major comments (2)

[Abstract] Abstract (SAMT proposal paragraph): The claim that 'deriving mutant generation rules directly from ... STPA' ensures mutants 'represent genuine hazards' and that SAMT thereby 'provides a rigorous mechanism' is load-bearing for the central contribution, yet the manuscript supplies no STPA-to-mutation derivation procedure, no sample control structure or unsafe control action, and no concrete translation to a message mutation. Without such an example the assumption that the resulting faults capture genuine hazards remains unillustrated and untestable.
[Abstract] Abstract (SAMT proposal paragraph): The paper introduces 'temporally bounded faults' into messages as the core mechanism for simulating interaction failures but provides neither a definition of temporal bounding nor an argument or illustration showing why such faults are more realistic or more effective than component-level mutations at exposing the interaction issues that cause accidents. This absence directly undermines the superiority claim over traditional mutation testing.

minor comments (1)

The open challenges section is mentioned only in passing; expanding it with concrete research questions (e.g., how to automate the STPA-to-mutation step) would strengthen the vision paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our vision paper. We agree that the abstract claims would be strengthened by concrete illustrations of the proposed mapping and definitions, and we will revise the manuscript to incorporate these elements.

read point-by-point responses

Referee: [Abstract] Abstract (SAMT proposal paragraph): The claim that 'deriving mutant generation rules directly from ... STPA' ensures mutants 'represent genuine hazards' and that SAMT thereby 'provides a rigorous mechanism' is load-bearing for the central contribution, yet the manuscript supplies no STPA-to-mutation derivation procedure, no sample control structure or unsafe control action, and no concrete translation to a message mutation. Without such an example the assumption that the resulting faults capture genuine hazards remains unillustrated and untestable.

Authors: We acknowledge that the manuscript does not supply a detailed STPA-to-mutation derivation procedure or concrete example. As this is a vision paper proposing a new paradigm, the emphasis is on the high-level approach rather than a fully worked implementation. To address the concern, we will add an illustrative example in the revised manuscript, including a sample control structure, an unsafe control action, and its translation to a specific message mutation. This will make the proposal more concrete without altering the vision-oriented nature of the work. revision: yes
Referee: [Abstract] Abstract (SAMT proposal paragraph): The paper introduces 'temporally bounded faults' into messages as the core mechanism for simulating interaction failures but provides neither a definition of temporal bounding nor an argument or illustration showing why such faults are more realistic or more effective than component-level mutations at exposing the interaction issues that cause accidents. This absence directly undermines the superiority claim over traditional mutation testing.

Authors: We agree that the absence of a definition for 'temporally bounded faults' and supporting argument leaves the distinction from traditional mutation testing insufficiently clear. In the revision we will add a precise definition (faults that affect message content or timing only within a bounded temporal window) together with a short illustrative comparison showing how this targets inter-module interaction failures more directly than component-level mutations. revision: yes

Circularity Check

0 steps flagged

No circularity; vision paper with no derivations or self-referential steps

full rationale

The paper is a conceptual vision proposal for SAMT that suggests deriving mutant rules from STPA but supplies no equations, parameter fits, uniqueness theorems, or derivation chains. The abstract and description state the proposal without reducing any claim to its own inputs by construction, self-citation, or renaming. No load-bearing steps match the enumerated circularity patterns; the work is self-contained as an outline of open challenges rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that STPA can yield realistic fault rules and on the invented concept of SAMT itself; no free parameters or external benchmarks are involved.

axioms (1)

domain assumption STPA-derived rules will produce mutants that represent genuine hazards for ADS interaction failures
Invoked in the proposal of mutant generation rules directly from safety engineering frameworks.

invented entities (1)

Safety-Aware Mutation Testing (SAMT) no independent evidence
purpose: To provide a falsifiable measure of test adequacy focused on component interactions in ADS
New paradigm introduced to address limitations of traditional mutation testing and coverage metrics.

pith-pipeline@v0.9.1-grok · 5732 in / 1184 out tokens · 21560 ms · 2026-06-26T00:55:24.579322+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Finding critical scenarios for automated driving systems: A systematic mapping study,

X. Zhang, J. Tao, K. Tan, M. T ¨orngren, J. M. G. S´anchez, M. R. Ramli, X. Tao, M. Gyllenhammar, F. Wotawa, N. Mohan, M. Nica, and H. Felbinger, “Finding critical scenarios for automated driving systems: A systematic mapping study,”IEEE Transactions on Software Engi- neering, vol. 49, no. 3, pp. 991–1026, 2023

2023
[2]

A Survey on Scenario-Based Testing for Au- tomated Driving Systems in High-Fidelity Simulation,

Z. Zhong, Y . Tang, Y . Zhou, V . d. O. Neves, Y . Liu, and B. Ray, “A Survey on Scenario-Based Testing for Au- tomated Driving Systems in High-Fidelity Simulation,” Dec. 2021

2021
[3]

An Analysis and Survey of the Development of Mutation Testing,

Y . Jia and M. Harman, “An Analysis and Survey of the Development of Mutation Testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, Sep. 2011

2011
[4]

Mutation Testing Advances: An Anal- ysis and Survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . L. Traon, and M. Harman, “Mutation Testing Advances: An Anal- ysis and Survey,” inAdvances in Computers. Elsevier, 2019, vol. 112, pp. 275–378

2019
[5]

Pit: a practical mutation testing tool for java (demo),

H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: a practical mutation testing tool for java (demo),” inProceedings of the 25th International Symposium on Software Testing and Analysis, ser. ISSTA
[6]

New York, NY , USA: Association for Computing Machinery, 2016, p. 449–452

2016
[7]

A comprehensive empirical and theoretical analysis of batching algorithms for efficient, safe, parallel mutation analysis in rust,

Z. L ´evai, D. Shin, and P. McMinn, “A comprehensive empirical and theoretical analysis of batching algorithms for efficient, safe, parallel mutation analysis in rust,” ACM Trans. Softw. Eng. Methodol., Jan. 2026

2026
[8]

DeepMutation++: A Mutation Testing Framework for Deep Learning Systems,

Q. Hu, L. Ma, X. Xie, B. Yu, Y . Liu, and J. Zhao, “DeepMutation++: A Mutation Testing Framework for Deep Learning Systems,” in2019 34th IEEE/ACM Inter- national Conference on Automated Software Engineering (ASE). San Diego, CA, USA: IEEE, Nov. 2019, pp. 1158–1161

2019
[9]

Deep- Crime: mutation testing of deep learning systems based on real faults,

N. Humbatova, G. Jahangirova, and P. Tonella, “Deep- Crime: mutation testing of deep learning systems based on real faults,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analy- sis, ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, Jul. 2021, pp. 67–78

2021
[10]

Investigations of the software testing cou- pling effect,

A. J. Offutt, “Investigations of the software testing cou- pling effect,”ACM Trans. Softw. Eng. Methodol., vol. 1, no. 1, p. 5–20, Jan. 1992

1992
[11]

Are mutants a valid substitute for real faults in software testing?

R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” inProceedings of the 22nd ACM SIGSOFT International Symposium on Foun- dations of Software Engineering, ser. FSE 2014. New York, NY , USA: Association for Computing Machinery, Nov. 2014, pp. 654–665

2014
[12]

Mutations: How Close are they to Real Faults?

R. Gopinath, C. Jensen, and A. Groce, “Mutations: How Close are they to Real Faults?” in2014 IEEE 25th Inter- national Symposium on Software Reliability Engineering, Nov. 2014, pp. 189–200

2014
[13]

Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults,

M. Papadakis, D. Shin, S. Yoo, and D.-H. Bae, “Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, May 2018, pp. 537–548

2018
[14]

Two notions of correctness and their relation to testing,

T. A. Budd and D. Angluin, “Two notions of correctness and their relation to testing,”Acta informatica, vol. 18, no. 1, pp. 31–45, 1982

1982
[15]

Using program slicing to assist in the detection of equivalent mutants,

R. Hierons, M. Harman, and S. Danicic, “Using program slicing to assist in the detection of equivalent mutants,” Software Testing, Verification and Reliability, vol. 9, no. 4, pp. 233–262, 1999

1999
[16]

Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique,

M. Papadakis, Y . Jia, M. Harman, and Y . Le Traon, “Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique,” in2015 IEEE/ACM 37th IEEE In- ternational Conference on Software Engineering, vol. 1, May 2015, pp. 936–946

2015
[17]

Large Language Models for Equivalent Mutant Detection: How Far Are We?

Z. Tian, H. Shu, D. Wang, X. Cao, Y . Kamei, and J. Chen, “Large Language Models for Equivalent Mutant Detection: How Far Are We?” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY , USA: Association for Computing Machinery, Sep. 2024, pp. 1733–1745

2024
[18]

Property-based mutation testing,

E. Bartocci, L. Mariani, D. Ni ˇckovi´c, and D. Yadav, “Property-based mutation testing,” in2023 IEEE Con- ference on Software Testing, Verification and Validation (ICST), 2023, pp. 222–233

2023
[19]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” 2016. [Online]. Available: https://arxiv.org/abs/1604.07316

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Baidu Apollo EM Motion Planner

H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong, “Baidu Apollo EM motion planner,”arXiv preprint arXiv:1807.08048, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Autoware on board: Enabling autonomous vehicles with embedded systems,

S. Kato, S. Takeuchi, Y . Ishiguro, Y . Ninomiya, K. Ot- suka, and T. Miyata, “Autoware on board: Enabling autonomous vehicles with embedded systems,” in2018 ACM/IEEE 9th International Conference on Cyber- Physical Systems (ICCPS). IEEE, 2018, pp. 287–296

2018
[22]

Robot Operating System 2: Design, archi- tecture, and uses in the wild,

S. Macenski, T. Foote, B. Gerkey, C. Lalancette, and W. Woodall, “Robot Operating System 2: Design, archi- tecture, and uses in the wild,”Science Robotics, vol. 7, no. 66, p. eabm6074, May 2022

2022
[23]

Deeptest: auto- mated testing of deep-neural-network-driven autonomous cars,

Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: auto- mated testing of deep-neural-network-driven autonomous cars,” inProceedings of the 40th International Confer- ence on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 303–314

2018
[24]

Structural test coverage criteria for deep neural networks,

Y . Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore, “Structural test coverage criteria for deep neural networks,”ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, Oct. 2019

2019
[25]

S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles,

T. Woodlief, F. Toledo, S. Elbaum, and M. B. Dwyer, “S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles,” inProceedings of the IEEE/ACM 46th Inter- national Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, Apr. 2024, pp. 1–13

2024
[26]

STARS: A Tool for Measuring Scenario Coverage When Testing Autonomous Robotic Systems,

T. Schallau, D. M ¨ackel, S. Naujokat, and F. Howar, “STARS: A Tool for Measuring Scenario Coverage When Testing Autonomous Robotic Systems,” inDependable Computing – EDCC 2024 Workshops, B. Sangchoolie, R. Adler, R. Hawkins, P. Schleiss, A. Arteconi, and A. Mancini, Eds. Cham: Springer Nature Switzerland, 2024, pp. 62–70

2024
[27]

Leveson,Engineering a safer world: systems thinking applied to safety, ser

N. Leveson,Engineering a safer world: systems thinking applied to safety, ser. Engineering systems. Cambridge, Mass: MIT Press, 2011

2011
[28]

CARLA: An Open Urban Driving Simula- tor,

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “CARLA: An Open Urban Driving Simula- tor,” inProceedings of the 1st Annual Conference on Robot Learning. PMLR, Oct. 2017, pp. 1–16

2017
[29]

Search-based software test data generation: a survey,

P. McMinn, “Search-based software test data generation: a survey,”Software Testing, Verification and Reliability, vol. 14, no. 2, pp. 105–156, 2004

2004
[30]

Mutation based test case generation via a path selection strategy,

M. Papadakis and N. Malevris, “Mutation based test case generation via a path selection strategy,”Information and Software Technology, vol. 54, no. 9, pp. 915–932, Sep. 2012

2012
[31]

Achieving scalable mutation- based generation of whole test suites,

G. Fraser and A. Arcuri, “Achieving scalable mutation- based generation of whole test suites,”Empirical Soft- ware Engineering, vol. 20, no. 3, pp. 783–812, Jun. 2015

2015
[32]

A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems,

Y . Chen, Y . Huai, Y . He, S. Li, C. Hong, Q. A. Chen, and Joshua Garcia, “A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 380–402, Jun. 2025

2025
[33]

Defects4J: a database of existing faults to enable controlled testing studies for Java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4J: a database of existing faults to enable controlled testing studies for Java programs,” inProceedings of the 2014 International Symposium on Software Testing and Analysis, ser. ISSTA

2014
[34]

2014, pp

New York, NY , USA: Association for Computing Machinery, Jul. 2014, pp. 437–440

2014
[35]

Metallaxis-FL: mutation-based fault localization,

M. Papadakis and Y . Le Traon, “Metallaxis-FL: mutation-based fault localization,”Software Testing, Ver- ification and Reliability, vol. 25, no. 5-7, pp. 605–628, 2015

2015
[36]

Mutation-based fault localization for real-world multilingual programs (t),

S. Hong, B. Lee, T. Kwak, Y . Jeon, B. Ko, Y . Kim, and M. Kim, “Mutation-based fault localization for real-world multilingual programs (t),” in2015 30th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE), 2015, pp. 464–475

2015
[37]

Mutation-based Fault Localization of Deep Neu- ral Networks,

A. Ghanbari, D.-G. Thomas, M. A. Arshad, and H. Ra- jan, “Mutation-based Fault Localization of Deep Neu- ral Networks,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). Luxembourg, Luxembourg: IEEE, Sep. 2023, pp. 1301– 1313

2023
[38]

ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation,

S. Feng, Y . Ye, Q. Shi, Z. Cheng, X. Xu, S. Cheng, H. Choi, and X. Zhang, “ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering. Sacramento CA USA: ACM, Oct. 2024, pp. 1620–1632

2024
[39]

ACA V: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings,

H. Sun, C. M. Poskitt, Y . Sun, J. Sun, and Y . Chen, “ACA V: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings,” inPro- ceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, Apr. 2024, pp. 1–13

2024
[40]

FIXDRIVE: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation,

Y . Sun, C. M. Poskitt, K. Wang, and J. Sun, “FIXDRIVE: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Apr. 2025, pp. 1921–1933

2025
[41]

On Determinism of Game Engines Used for Simulation-Based Autonomous Vehicle Verifi- cation,

G. Chance, A. Ghobrial, K. McAreavey, S. Lemaignan, T. Pipe, and K. Eder, “On Determinism of Game Engines Used for Simulation-Based Autonomous Vehicle Verifi- cation,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. 20 538–20 552, Nov. 2022

2022
[42]

Empirically Evaluating Flaky Tests for Autonomous Driving Systems in Simulated Environments,

O. Osikowicz, P. McMinn, and D. Shin, “Empirically Evaluating Flaky Tests for Autonomous Driving Systems in Simulated Environments,” in2025 IEEE/ACM Inter- national Flaky Tests Workshop (FTW), Apr. 2025, pp. 13–20

2025

[1] [1]

Finding critical scenarios for automated driving systems: A systematic mapping study,

X. Zhang, J. Tao, K. Tan, M. T ¨orngren, J. M. G. S´anchez, M. R. Ramli, X. Tao, M. Gyllenhammar, F. Wotawa, N. Mohan, M. Nica, and H. Felbinger, “Finding critical scenarios for automated driving systems: A systematic mapping study,”IEEE Transactions on Software Engi- neering, vol. 49, no. 3, pp. 991–1026, 2023

2023

[2] [2]

A Survey on Scenario-Based Testing for Au- tomated Driving Systems in High-Fidelity Simulation,

Z. Zhong, Y . Tang, Y . Zhou, V . d. O. Neves, Y . Liu, and B. Ray, “A Survey on Scenario-Based Testing for Au- tomated Driving Systems in High-Fidelity Simulation,” Dec. 2021

2021

[3] [3]

An Analysis and Survey of the Development of Mutation Testing,

Y . Jia and M. Harman, “An Analysis and Survey of the Development of Mutation Testing,”IEEE Transactions on Software Engineering, vol. 37, no. 5, pp. 649–678, Sep. 2011

2011

[4] [4]

Mutation Testing Advances: An Anal- ysis and Survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . L. Traon, and M. Harman, “Mutation Testing Advances: An Anal- ysis and Survey,” inAdvances in Computers. Elsevier, 2019, vol. 112, pp. 275–378

2019

[5] [5]

Pit: a practical mutation testing tool for java (demo),

H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: a practical mutation testing tool for java (demo),” inProceedings of the 25th International Symposium on Software Testing and Analysis, ser. ISSTA

[6] [6]

New York, NY , USA: Association for Computing Machinery, 2016, p. 449–452

2016

[7] [7]

A comprehensive empirical and theoretical analysis of batching algorithms for efficient, safe, parallel mutation analysis in rust,

Z. L ´evai, D. Shin, and P. McMinn, “A comprehensive empirical and theoretical analysis of batching algorithms for efficient, safe, parallel mutation analysis in rust,” ACM Trans. Softw. Eng. Methodol., Jan. 2026

2026

[8] [8]

DeepMutation++: A Mutation Testing Framework for Deep Learning Systems,

Q. Hu, L. Ma, X. Xie, B. Yu, Y . Liu, and J. Zhao, “DeepMutation++: A Mutation Testing Framework for Deep Learning Systems,” in2019 34th IEEE/ACM Inter- national Conference on Automated Software Engineering (ASE). San Diego, CA, USA: IEEE, Nov. 2019, pp. 1158–1161

2019

[9] [9]

Deep- Crime: mutation testing of deep learning systems based on real faults,

N. Humbatova, G. Jahangirova, and P. Tonella, “Deep- Crime: mutation testing of deep learning systems based on real faults,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analy- sis, ser. ISSTA 2021. New York, NY , USA: Association for Computing Machinery, Jul. 2021, pp. 67–78

2021

[10] [10]

Investigations of the software testing cou- pling effect,

A. J. Offutt, “Investigations of the software testing cou- pling effect,”ACM Trans. Softw. Eng. Methodol., vol. 1, no. 1, p. 5–20, Jan. 1992

1992

[11] [11]

Are mutants a valid substitute for real faults in software testing?

R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, “Are mutants a valid substitute for real faults in software testing?” inProceedings of the 22nd ACM SIGSOFT International Symposium on Foun- dations of Software Engineering, ser. FSE 2014. New York, NY , USA: Association for Computing Machinery, Nov. 2014, pp. 654–665

2014

[12] [12]

Mutations: How Close are they to Real Faults?

R. Gopinath, C. Jensen, and A. Groce, “Mutations: How Close are they to Real Faults?” in2014 IEEE 25th Inter- national Symposium on Software Reliability Engineering, Nov. 2014, pp. 189–200

2014

[13] [13]

Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults,

M. Papadakis, D. Shin, S. Yoo, and D.-H. Bae, “Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, May 2018, pp. 537–548

2018

[14] [14]

Two notions of correctness and their relation to testing,

T. A. Budd and D. Angluin, “Two notions of correctness and their relation to testing,”Acta informatica, vol. 18, no. 1, pp. 31–45, 1982

1982

[15] [15]

Using program slicing to assist in the detection of equivalent mutants,

R. Hierons, M. Harman, and S. Danicic, “Using program slicing to assist in the detection of equivalent mutants,” Software Testing, Verification and Reliability, vol. 9, no. 4, pp. 233–262, 1999

1999

[16] [16]

Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique,

M. Papadakis, Y . Jia, M. Harman, and Y . Le Traon, “Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique,” in2015 IEEE/ACM 37th IEEE In- ternational Conference on Software Engineering, vol. 1, May 2015, pp. 936–946

2015

[17] [17]

Large Language Models for Equivalent Mutant Detection: How Far Are We?

Z. Tian, H. Shu, D. Wang, X. Cao, Y . Kamei, and J. Chen, “Large Language Models for Equivalent Mutant Detection: How Far Are We?” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY , USA: Association for Computing Machinery, Sep. 2024, pp. 1733–1745

2024

[18] [18]

Property-based mutation testing,

E. Bartocci, L. Mariani, D. Ni ˇckovi´c, and D. Yadav, “Property-based mutation testing,” in2023 IEEE Con- ference on Software Testing, Verification and Validation (ICST), 2023, pp. 222–233

2023

[19] [19]

End to End Learning for Self-Driving Cars

M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” 2016. [Online]. Available: https://arxiv.org/abs/1604.07316

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Baidu Apollo EM Motion Planner

H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong, “Baidu Apollo EM motion planner,”arXiv preprint arXiv:1807.08048, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Autoware on board: Enabling autonomous vehicles with embedded systems,

S. Kato, S. Takeuchi, Y . Ishiguro, Y . Ninomiya, K. Ot- suka, and T. Miyata, “Autoware on board: Enabling autonomous vehicles with embedded systems,” in2018 ACM/IEEE 9th International Conference on Cyber- Physical Systems (ICCPS). IEEE, 2018, pp. 287–296

2018

[22] [22]

Robot Operating System 2: Design, archi- tecture, and uses in the wild,

S. Macenski, T. Foote, B. Gerkey, C. Lalancette, and W. Woodall, “Robot Operating System 2: Design, archi- tecture, and uses in the wild,”Science Robotics, vol. 7, no. 66, p. eabm6074, May 2022

2022

[23] [23]

Deeptest: auto- mated testing of deep-neural-network-driven autonomous cars,

Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: auto- mated testing of deep-neural-network-driven autonomous cars,” inProceedings of the 40th International Confer- ence on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 303–314

2018

[24] [24]

Structural test coverage criteria for deep neural networks,

Y . Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore, “Structural test coverage criteria for deep neural networks,”ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, Oct. 2019

2019

[25] [25]

S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles,

T. Woodlief, F. Toledo, S. Elbaum, and M. B. Dwyer, “S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles,” inProceedings of the IEEE/ACM 46th Inter- national Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, Apr. 2024, pp. 1–13

2024

[26] [26]

STARS: A Tool for Measuring Scenario Coverage When Testing Autonomous Robotic Systems,

T. Schallau, D. M ¨ackel, S. Naujokat, and F. Howar, “STARS: A Tool for Measuring Scenario Coverage When Testing Autonomous Robotic Systems,” inDependable Computing – EDCC 2024 Workshops, B. Sangchoolie, R. Adler, R. Hawkins, P. Schleiss, A. Arteconi, and A. Mancini, Eds. Cham: Springer Nature Switzerland, 2024, pp. 62–70

2024

[27] [27]

Leveson,Engineering a safer world: systems thinking applied to safety, ser

N. Leveson,Engineering a safer world: systems thinking applied to safety, ser. Engineering systems. Cambridge, Mass: MIT Press, 2011

2011

[28] [28]

CARLA: An Open Urban Driving Simula- tor,

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “CARLA: An Open Urban Driving Simula- tor,” inProceedings of the 1st Annual Conference on Robot Learning. PMLR, Oct. 2017, pp. 1–16

2017

[29] [29]

Search-based software test data generation: a survey,

P. McMinn, “Search-based software test data generation: a survey,”Software Testing, Verification and Reliability, vol. 14, no. 2, pp. 105–156, 2004

2004

[30] [30]

Mutation based test case generation via a path selection strategy,

M. Papadakis and N. Malevris, “Mutation based test case generation via a path selection strategy,”Information and Software Technology, vol. 54, no. 9, pp. 915–932, Sep. 2012

2012

[31] [31]

Achieving scalable mutation- based generation of whole test suites,

G. Fraser and A. Arcuri, “Achieving scalable mutation- based generation of whole test suites,”Empirical Soft- ware Engineering, vol. 20, no. 3, pp. 783–812, Jun. 2015

2015

[32] [32]

A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems,

Y . Chen, Y . Huai, Y . He, S. Li, C. Hong, Q. A. Chen, and Joshua Garcia, “A Comprehensive Study of Bug-Fix Patterns in Autonomous Driving Systems,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 380–402, Jun. 2025

2025

[33] [33]

Defects4J: a database of existing faults to enable controlled testing studies for Java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4J: a database of existing faults to enable controlled testing studies for Java programs,” inProceedings of the 2014 International Symposium on Software Testing and Analysis, ser. ISSTA

2014

[34] [34]

2014, pp

New York, NY , USA: Association for Computing Machinery, Jul. 2014, pp. 437–440

2014

[35] [35]

Metallaxis-FL: mutation-based fault localization,

M. Papadakis and Y . Le Traon, “Metallaxis-FL: mutation-based fault localization,”Software Testing, Ver- ification and Reliability, vol. 25, no. 5-7, pp. 605–628, 2015

2015

[36] [36]

Mutation-based fault localization for real-world multilingual programs (t),

S. Hong, B. Lee, T. Kwak, Y . Jeon, B. Ko, Y . Kim, and M. Kim, “Mutation-based fault localization for real-world multilingual programs (t),” in2015 30th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE), 2015, pp. 464–475

2015

[37] [37]

Mutation-based Fault Localization of Deep Neu- ral Networks,

A. Ghanbari, D.-G. Thomas, M. A. Arshad, and H. Ra- jan, “Mutation-based Fault Localization of Deep Neu- ral Networks,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). Luxembourg, Luxembourg: IEEE, Sep. 2023, pp. 1301– 1313

2023

[38] [38]

ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation,

S. Feng, Y . Ye, Q. Shi, Z. Cheng, X. Xu, S. Cheng, H. Choi, and X. Zhang, “ROCAS: Root Cause Analysis of Autonomous Driving Accidents via Cyber-Physical Co-mutation,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering. Sacramento CA USA: ACM, Oct. 2024, pp. 1620–1632

2024

[39] [39]

ACA V: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings,

H. Sun, C. M. Poskitt, Y . Sun, J. Sun, and Y . Chen, “ACA V: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings,” inPro- ceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, Apr. 2024, pp. 1–13

2024

[40] [40]

FIXDRIVE: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation,

Y . Sun, C. M. Poskitt, K. Wang, and J. Sun, “FIXDRIVE: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Apr. 2025, pp. 1921–1933

2025

[41] [41]

On Determinism of Game Engines Used for Simulation-Based Autonomous Vehicle Verifi- cation,

G. Chance, A. Ghobrial, K. McAreavey, S. Lemaignan, T. Pipe, and K. Eder, “On Determinism of Game Engines Used for Simulation-Based Autonomous Vehicle Verifi- cation,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. 20 538–20 552, Nov. 2022

2022

[42] [42]

Empirically Evaluating Flaky Tests for Autonomous Driving Systems in Simulated Environments,

O. Osikowicz, P. McMinn, and D. Shin, “Empirically Evaluating Flaky Tests for Autonomous Driving Systems in Simulated Environments,” in2025 IEEE/ACM Inter- national Flaky Tests Workshop (FTW), Apr. 2025, pp. 13–20

2025