arxiv: 2604.22640 · v1 · submitted 2026-04-24 · 💻 cs.SE · cs.LG

Recognition: unknown

Quality-Driven Selective Mutation for Deep Learning

Zaheed Ahmed , Emmanuel Charleson Dapaah , Philip Makedonski , Jens Grabowski

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:11 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords mutation testingdeep learningselective mutationmutant resistancemutant realismJaccard similarityprobabilistic frameworksoftware testing

0 comments

The pith

A probabilistic framework ranks deep-learning mutation operators by resistance and realism to cut generated mutants by up to 55.6 percent while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a probabilistic framework to evaluate mutation operators for deep learning along two axes: resistance, measured by statistical probabilities of being killed by tests, and realism, measured by how closely mutant detectability patterns match those of actual faults using generalized Jaccard similarity. This dual scoring lets the method rank and filter operator configurations to produce fewer mutants that still meet quality thresholds for either improving tests or simulating real bugs. Evaluation draws on four datasets of real DL faults, using three to select configurations and a held-out fourth for confirmation. The results show that quality-driven filtering reduces mutant volume by as much as 55.6 percent while resistance and realism stay at typical levels when thresholds align with standard baselines. The framework is designed to work without assuming any one specific use for the mutants.

Core claim

The paper establishes that mutant quality in deep learning can be quantified probabilistically by combining resistance, defined through statistical killing probabilities, and realism, captured by generalized Jaccard similarity on detectability patterns between mutants and real faults. This dual measure allows selection of high-quality mutation-operator configurations that lower the volume of mutants generated and executed, with empirical results on CleanML, DeepFD, DeepLocalize, and defect4ML datasets confirming reductions up to 55.6 percent while holding resistance and realism steady under baseline-aligned thresholds.

What carries the argument

The probabilistic framework that scores mutation-operator configurations on resistance via killing probabilities and on realism via generalized Jaccard similarity of detectability patterns to enable ranking and filtering of high-quality configurations.

If this is right

High-quality operator configurations reduce the number of mutants while retaining their effectiveness for guiding test improvement.
The same configurations maintain comparable realism when mutants act as substitutes for real DL faults.
Operator selection works without requiring a commitment to one particular role for the mutants.
The preservation of resistance and realism levels holds when validated on held-out fault data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If detectability patterns prove consistent across additional DL applications, the similarity metric could guide operator choice in new projects even with sparse fault data.
The reduction in mutant count could make full mutation testing practical for larger DL models where complete generation was previously too expensive.
The framework might be combined with other DL debugging methods that already use mutants to further lower overall testing cost.
Collecting more real DL fault data over time could refine the realism measure and improve selection accuracy.

Load-bearing premise

The assumption that generalized Jaccard similarity on detectability patterns serves as a valid proxy for how well mutants substitute for real deep learning faults, and that the four chosen datasets represent the faults encountered in practice.

What would settle it

Applying the selected high-quality operators versus the full set to a new deep learning system, then measuring whether the selected mutants show markedly lower statistical killing resistance or lower Jaccard similarity to independently collected real faults in that system, would falsify the claim of preserved quality.

Figures

Figures reproduced from arXiv: 2604.22640 by Emmanuel Charleson Dapaah, Jens Grabowski, Philip Makedonski, Zaheed Ahmed.

**Figure 1.** Figure 1: Overview of the methodology for assessing mutant quality using probabilistic killing in DL. KI, KP, IQ, and EQ are view at source ↗

**Figure 2.** Figure 2: Operator-wise distributions of mutant-level IQ across CleanML, DeepFD, and DeepLocalize, grouped by dataset. view at source ↗

**Figure 3.** Figure 3: Operator-wise distributions of mutant-level EQ across CleanML, DeepFD, and DeepLocalize, grouped by dataset. view at source ↗

**Figure 4.** Figure 4: Mutant-level IQ–EQ scatter with median-based quadrant partitioning, illustrating the relationship between resistance view at source ↗

**Figure 5.** Figure 5: Number of mutants retained on defect4ML under view at source ↗

**Figure 6.** Figure 6: Relative changes in High–High proportion, median view at source ↗

read the original abstract

Mutants support testing and debugging in two roles: (i) as test goals and (ii) as substitutes for real faults. Hard-to-kill mutants provide better guidance for test improvement, while realism is essential when mutants are used to simulate real bugs. Building on these roles, selective mutation for deep learning (DL) aims to reduce the cost of mutant generation and execution by choosing operator configurations that yield resistant and realistic mutants. However, the DL literature lacks a unified measure that captures both aspects. This study presents a probabilistic framework to quantify mutant quality along two complementary axes: resistance and realism. Resistance adapts the classical notion of hard-to-kill mutants to the DL setting using statistical killing probabilities, while realism is measured via the generalized Jaccard similarity between mutant and real-fault detectability patterns. The framework enables ranking and filtering of low-quality mutation-operator configurations without assuming a specific use case. We empirically evaluate the approach on four datasets of real DL faults. Three datasets (CleanML, DeepFD, and DeepLocalize) are used to estimate and select high-quality operator configurations, and the held-out defect4ML dataset is used for validation. Results show that quality-driven selection reduces the number of generated mutants by up to 55.6% while preserving typical levels of resistance and realism under baseline-aligned selection thresholds. These findings confirm that dual-objective selection can lower cost without compromising the usefulness of mutants for either role.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a concrete way to cut DL mutation costs by over half via dual resistance-plus-realism selection on operator configs, but the Jaccard detectability proxy is not checked against actual model behavior changes from real faults.

read the letter

The main point is a selective mutation method for deep learning that ranks and filters operator configurations to keep resistant mutants while using a realism score, delivering up to 55.6% fewer mutants on the held-out dataset without dropping the measured levels much. They fit the selection on three real-fault collections and check it on defect4ML. That reduction number is the practical hook if it holds up in use. What is new here is the joint probabilistic ranking that combines killing probabilities for resistance with generalized Jaccard similarity on detectability patterns across the datasets. Prior selective mutation work usually focused on one goal or used simpler heuristics; this tries to optimize both at once for the DL case without tying the choice to a single downstream use. The paper does well by grounding the evaluation in multiple real DL fault datasets instead of synthetic mutants alone and by using a held-out set to test the selection procedure. That setup directly addresses the usual worry that the reduction is just overfitting to the training data. The approach is straightforward enough that someone could reimplement the ranking step from the description. The soft spot is the realism proxy. The claim that the selected mutants still work as substitutes for real faults rests on the Jaccard score staying similar, but there is no reported check on whether mutants with high Jaccard scores actually produce comparable accuracy drops or prediction shifts as the original real faults. The stress-test note is right on this: preserving the proxy score does not automatically mean preserved usefulness for fault simulation. If the correlation between detectability pattern overlap and behavioral impact is only moderate, the cost saving comes with an unquantified risk to one of the two roles the paper emphasizes. The abstract also skips the exact equations for the killing probabilities and the threshold alignment, so a referee would need those details to judge the statistical soundness. This work is for researchers and practitioners in mutation testing for machine learning who already run into the cost wall with exhaustive operators. A reader looking for a practical filter to apply on top of existing DL mutation tools would get immediate value from the ranking procedure and the reported reduction. It is not a foundational theory paper, but the empirical framing is clear enough to be worth referee time. I would send it to peer review with a request for explicit validation of the Jaccard proxy against model behavior metrics and for the full derivation of the resistance scores.

Referee Report

2 major / 2 minor

Summary. The paper presents a probabilistic framework for quality-driven selective mutation in deep learning. It quantifies mutant quality along two axes—resistance (via statistical killing probabilities adapted to DL) and realism (via generalized Jaccard similarity on detectability patterns between mutants and real faults)—to rank and filter mutation-operator configurations. Using three datasets (CleanML, DeepFD, DeepLocalize) for estimation and selection, with defect4ML held out for validation, the authors report that the approach reduces generated mutants by up to 55.6% while preserving typical levels of resistance and realism under baseline-aligned thresholds, supporting both test-guidance and fault-substitution roles.

Significance. If the central empirical result holds, the framework could meaningfully lower the cost of mutation testing for DL systems, making it more practical to use mutants for improving test suites and simulating real faults. The held-out validation across four real DL fault datasets and the dual-objective (resistance + realism) selection are strengths that distinguish this from prior single-axis selective mutation work.

major comments (2)

§3.2 (realism quantification): The central claim that quality-driven selection preserves mutant usefulness as substitutes for real DL faults rests on generalized Jaccard similarity of detectability patterns being a valid proxy for realism. No external validation is provided showing correlation between high Jaccard scores and actual behavioral equivalence (e.g., comparable accuracy drops or prediction shifts induced by the mutant versus the corresponding real fault). This assumption is load-bearing for interpreting the 55.6% reduction as maintaining both roles.
Results section (defect4ML validation): The held-out check confirms that selected configurations retain similar Jaccard scores to the training datasets, but this only verifies consistency of the proxy, not that the proxy tracks real-fault likeness. Without an additional behavioral-impact comparison (e.g., against model-output divergence metrics), the preservation of 'typical levels of realism' does not fully support the fault-substitution usefulness claim.

minor comments (2)

Abstract: The summary states the 55.6% reduction and framework but omits any reference to the key probabilistic equations or the exact definition of baseline-aligned thresholds, reducing accessibility for readers assessing the technical contribution from the abstract alone.
Notation and figures: Ensure all symbols for killing probabilities and Jaccard indices are defined consistently; some figure captions could more explicitly link plotted quantities back to the resistance and realism axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and the recommendation for major revision. We appreciate the focus on the validity of the realism proxy and its implications for the fault-substitution claim. Below we respond point by point to the major comments, offering clarifications grounded in the manuscript while acknowledging where additional discussion is warranted.

read point-by-point responses

Referee: §3.2 (realism quantification): The central claim that quality-driven selection preserves mutant usefulness as substitutes for real DL faults rests on generalized Jaccard similarity of detectability patterns being a valid proxy for realism. No external validation is provided showing correlation between high Jaccard scores and actual behavioral equivalence (e.g., comparable accuracy drops or prediction shifts induced by the mutant versus the corresponding real fault). This assumption is load-bearing for interpreting the 55.6% reduction as maintaining both roles.

Authors: We thank the referee for this observation. The generalized Jaccard similarity is defined on detectability patterns precisely because these patterns encode observable test outcomes for both mutants and real faults, providing a direct, dataset-agnostic measure of how similarly they would guide test improvement or serve as fault substitutes. This choice aligns with the two roles articulated in the introduction and avoids requiring model-specific internal metrics that may not generalize across the four DL fault datasets. We acknowledge that an explicit correlation analysis with behavioral metrics such as accuracy drops or prediction shifts would offer complementary evidence. In the revised version we will expand the discussion in §3.2 to articulate the rationale for the proxy, its connection to behavioral equivalence via test exposure, and the limitations of the current validation. revision: partial
Referee: Results section (defect4ML validation): The held-out check confirms that selected configurations retain similar Jaccard scores to the training datasets, but this only verifies consistency of the proxy, not that the proxy tracks real-fault likeness. Without an additional behavioral-impact comparison (e.g., against model-output divergence metrics), the preservation of 'typical levels of realism' does not fully support the fault-substitution usefulness claim.

Authors: We agree that the held-out results on defect4ML primarily establish consistency of the selected operator configurations under the proposed dual-axis metric rather than an independent verification of the proxy itself. The reported preservation of typical Jaccard levels therefore supports generalizability of the quality-driven selection procedure, but the fault-substitution interpretation remains tied to the realism axis as defined. We will revise the Results section to make this scoping explicit, clarify that claims about usefulness for fault substitution are made with respect to the Jaccard-based realism measure, and add a brief forward-looking statement on the value of future behavioral-impact studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity: held-out validation measures reduction independently after selection on separate datasets

full rationale

The paper defines resistance via statistical killing probabilities and realism via generalized Jaccard similarity on detectability patterns, then uses three datasets (CleanML, DeepFD, DeepLocalize) to select high-quality operator configurations and applies them to the held-out defect4ML dataset for validation. The reported up to 55.6% reduction in generated mutants is a direct count of configurations retained under the selection thresholds, not a fitted parameter or prediction derived from the same inputs. Preservation of resistance and realism is checked as an independent measurement on the validation set rather than being tautological by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the derivation; the framework remains self-contained with external held-out evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that the chosen real-fault datasets are representative and that Jaccard similarity on detectability patterns captures realism; no free parameters are explicitly fitted beyond selection thresholds, and no new entities are postulated.

free parameters (1)

baseline-aligned selection thresholds
Thresholds used to filter low-quality operator configurations; values are aligned to baseline performance but not numerically specified.

axioms (1)

domain assumption The four datasets (CleanML, DeepFD, DeepLocalize, defect4ML) collectively represent typical DL faults.
Used both for estimating high-quality configurations and for held-out validation.

pith-pipeline@v0.9.0 · 5557 in / 1251 out tokens · 48363 ms · 2026-05-08T11:11:01.079934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 43 canonical work pages

[1]

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, and Dayi Lin. 2024. TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks.IEEE Transactions on Software Engineering50, 12 (2024), 1–23. doi:10.1109/TSE.2024.3482984

work page doi:10.1109/tse.2024.3482984 2024
[2]

Mohamed Abdelaal, Christian Hammacher, and Harald Schöning. 2023. REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines. doi:10.48786/EDBT.2023.43

work page doi:10.48786/edbt.2023.43 2023
[3]

Zaheed Ahmed and Philip Makedonski. 2024. Exploring the Fundamentals of Mutations in Deep Neural Networks. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (MODELS Companion ’24). Association for Computing Machinery, New York, NY, USA, 227–233. doi:10.1145/3652620.3687426

work page doi:10.1145/3652620.3687426 2024
[4]

Zaheed Ahmed, Philip Makedonski, and Jens Grabowski. 2025. An Empirical Study of the Realism of Mutants in Deep Learning. doi:10.48550/arXiv.2512.16741 arXiv:2512.16741 [cs]. Quality-Driven Selective Mutation for Deep Learning Preprint, April, 2026

work page doi:10.48550/arxiv.2512.16741 2025
[5]

Barr, René Just, and Charles Sutton

Miltiadis Allamanis, Earl T. Barr, René Just, and Charles Sutton. 2016. Tailored Mutants Fit Bugs Better. doi:10.48550/arXiv.1611.02516 arXiv:1611.02516 [cs]

work page doi:10.48550/arxiv.1611.02516 2016
[6]

David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas Reps. 2017. The care and feeding of wild-caught mutants. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 511–522. doi:10.1145/3106237. 3106280

work page doi:10.1145/3106237 2017
[7]

Jialun Cao, Meiziniu Li, Xiao Chen, Ming Wen, Yongqiang Tian, Bo Wu, and Shing-Chi Cheung. 2022. DeepFD: automated fault diagnosis and localization for deep learning programs. InProceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 573–585. doi:10.1145/3510003.3510099

work page doi:10.1145/3510003.3510099 2022
[8]

Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, Macau SAR, China, 1190–1201. doi:10.1109/ICDE.2019.00109 ISSN: 2375-026X

work page doi:10.1109/icde.2019.00109 2019
[9]

Sangwan, Youakim Badr, and Satish M

Harsh Deokuliar, Raghvinder S. Sangwan, Youakim Badr, and Satish M. Srinivasan
[10]

2023), Pages 60:54–Pages 60:65

Improving Testing of Deep-learning Systems: A combination of differential and mutation testing results in better test data.Queue21, 5 (Nov. 2023), Pages 60:54–Pages 60:65. doi:10.1145/3631340

work page doi:10.1145/3631340 2023
[11]

Antonia Estero-Botaro, Francisco Palomo-Lozano, Inmaculada Medina-Bulo, Juan José Domínguez-Jiménez, and Antonio García-Domínguez. 2015. Qual- ity metrics for mutation testing with applications to WS-BPEL compositions. Software Testing, Verification and Reliability25, 5-7 (2015), 536–571. doi:10.1002/ stvr.1528 _eprint: https://onlinelibrary.wiley.com/doi...

work page doi:10.1002/stvr.1528 2015
[12]

Li-Chao Feng, Xing-Ya Wang, Shi-Yu Zhang, Rui-Zhi Gao, and Zhi-Hong Zhao
[13]

https://jit.ndhu.edu.tw/article/view/2705 Number: 3

Mutation Operator Reduction for Cost-effective Deep Learning Software Testing via Decision Boundary Change Measurement.Journal of Internet Technol- ogy23, 3 (May 2022), 601–610. https://jit.ndhu.edu.tw/article/view/2705 Number: 3

2022
[14]

Ali Ghanbari. 2024. Decomposition of Deep Neural Networks into Modules via Mutation Analysis. InProceedings of the 33rd ACM SIGSOFT International Sympo- sium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1669–1681. doi:10.1145/3650212.3680390

work page doi:10.1145/3650212.3680390 2024
[15]

Ali Ghanbari, Deepak-George Thomas, Muhammad Arbab Arshad, and Hridesh Rajan. 2023. Mutation-based Fault Localization of Deep Neural Networks. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, Luxembourg City, Luxembourg, 1301–1313. doi:10.1109/ ASE56229.2023.00171

work page arXiv 2023
[16]

Qiang Hu, Lei Ma, Xiaofei Xie, Bing Yu, Yang Liu, and Jianjun Zhao. 2019. Deep- Mutation++: A Mutation Testing Framework for Deep Learning Systems. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, 1158–1161. doi:10.1109/ASE.2019.00126

work page doi:10.1109/ase.2019.00126 2019
[17]

Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Denmark, 67–78. doi:10.1145/3460319.3464825

work page doi:10.1145/3460319.3464825 2021
[18]

Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2023. DeepCrime: from Real Faults to Mutation Testing Tool for Deep Learning. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceed- ings (ICSE-Companion). IEEE, Melbourne, Australia, 68–72. doi:10.1109/ICSE- Companion58688.2023.00027

work page doi:10.1109/icse- 2023
[19]

Gunel Jahangirova, Andrea Stocco, and Paolo Tonella. 2021. Quality Metrics and Oracles for Autonomous Vehicles Testing. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, Porto de Galinhas, Brazil, 194–204. doi:10.1109/ICST49551.2021.00030 ISSN: 2159-4848

work page doi:10.1109/icst49551.2021.00030 2021
[20]

Gunel Jahangirova and Paolo Tonella. 2020. An Empirical Evaluation of Mutation Operators for Deep Learning Systems. In2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE, Porto, Portugal, 74–84. doi:10.1109/ICST46399.2020.00018 ISSN: 2159-4848

work page doi:10.1109/icst46399.2020.00018 2020
[21]

Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing.IEEE Transactions on Software Engineering37, 5 (Sept. 2011), 649–678. doi:10.1109/TSE.2010.62

work page doi:10.1109/tse.2010.62 2011
[22]

Ernst, Reid Holmes, and Gordon Fraser

René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in soft- ware testing?. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654–665. doi:1...

work page doi:10.1145/2635868.2635929 2014
[23]

Kaufman, Ryan Featherman, Justin Alvin, Bob Kurtz, Paul Ammann, and René Just

Samuel J. Kaufman, Ryan Featherman, Justin Alvin, Bob Kurtz, Paul Ammann, and René Just. 2022. Prioritizing Mutants to Guide Mutation Testing. In2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). ACM, Pittsburgh, PA, USA, 1743–1754. doi:10.1145/3510003.3510187 ISSN: 1558-1225

work page doi:10.1145/3510003.3510187 2022
[24]

Jinhan Kim, Nargiz Humbatova, Gunel Jahangirova, Shin Yoo, and Paolo Tonella
[25]

doi:10.48550/arXiv.2501.09846 arXiv:2501.09846 [cs]

MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learn- ing. doi:10.48550/arXiv.2501.09846 arXiv:2501.09846 [cs]

work page doi:10.48550/arxiv.2501.09846
[26]

Delamaro, Mariet Kurtz, and Nida Gökçe

Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe. 2016. Analyzing the validity of selective mutation with dominator mutants. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). Association for Computing Machinery, New York, NY, USA, 571–582. doi:10.1145/...

work page doi:10.1145/2950290.2950322 2016
[27]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, Chania, Greece, 13–24. doi:10.1109/ICDE51399.2021.00009 ISSN: 2375-026X

work page doi:10.1109/icde51399.2021.00009 2021
[28]

Yanhui Li, Weijun Shen, Tengchao Wu, Lin Chen, Di Wu, Yuming Zhou, and Baowen Xu. 2022. How higher order mutant testing performs for deep learning models: A fine-grained evaluation of test effectiveness and efficiency improved from second-order mutant-classification tuples.Information and Software Tech- nology150 (Oct. 2022), 106954. doi:10.1016/j.infsof....

work page doi:10.1016/j.infsof.2022.106954 2022
[29]

Renhao Lin, Qinglei Zhou, Bin Wu, and Xiaofei Nan. 2022. Robustness evaluation for deep neural networks via mutation decision boundaries analysis.Information Sciences601 (July 2022), 147–161. doi:10.1016/j.ins.2022.04.020

work page doi:10.1016/j.ins.2022.04.020 2022
[30]

Lauren Lyons and Ali Ghanbari. 2025. On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, Naples, Italy, 267–278. doi:10.1109/ICST62969.2025.10989014 ISSN: 2159-4848

work page doi:10.1109/icst62969.2025.10989014 2025
[31]

Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). IEEE Computer Society, Memphis, TN, USA, 100–111. doi:10.1109/ISSRE.2018.00021

work page doi:10.1109/issre.2018.00021 2018
[32]

Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming (Jack) Jiang. 2023. Bugs in machine learning-based systems: a faultload benchmark.Empirical Software Engineering28, 3 (April 2023), 62. doi:10.1007/ s10664-023-10291-1

2023
[33]

Manuel Méndez, Miguel Benito-Parejo, and Mercedes G. Merayo. 2024. Test- ing the Robustness of Machine Learning Models Through Mutations. InAd- vances in Computational Collective Intelligence, Ngoc-Than Nguyen, Bogdan Franczyk, André Ludwig, Manuel Nunez, Jan Treur, Gottfried Vossen, and Adrianna Kozierkiewicz (Eds.). Springer Nature Switzerland, Cham, 30...

work page doi:10.1007/978-3-031-70248-8_24 2024
[34]

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation Testing Advances: An Analysis and Survey. InAdvances in Computers. Vol. 112. Elsevier, Amsterdam, The Netherlands, 275–378. doi:10. 1016/bs.adcom.2018.03.015

2019
[35]

Goran Petrović, Marko Ivanković, Gordon Fraser, and René Just. 2022. Practical Mutation Testing at Scale: A view from Google.IEEE Trans. Softw. Eng.48, 10 (Oct. 2022), 3900–3912. doi:10.1109/TSE.2021.3107634

work page doi:10.1109/tse.2021.3107634 2022
[36]

Maryam Vahdat Pour, Zhuo Li, Lei Ma, and Hadi Hemmati. 2021. A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding. In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, Porto de Galinhas, Brazil, 36–46. doi:10.1109/ICST49551. 2021.00016 ISSN: 2159-4848

work page doi:10.1109/icst49551 2021
[37]

Vincenzo Riccio, Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella
[38]

In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)

DeepMetis: Augmenting a Deep Learning Test Set to Increase its Mutation Score. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, Melbourne, Australia, 355–367. doi:10. 1109/ASE51524.2021.9678764

work page arXiv 2021
[39]

Weijun Shen, Yanhui Li, Yuanlei Han, Lin Chen, Di Wu, Yuming Zhou, and Baowen Xu. 2021. Boundary sampling to boost mutation testing for deep learning models.Information and Software Technology130 (Feb. 2021), 106413. doi:10. 1016/j.infsof.2020.106413

work page arXiv 2021
[40]

Weijun Shen, Jun Wan, and Zhenyu Chen. 2018. MuNN: Mutation Analysis of Neural Networks. In2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, Lisbon, Portugal, 108–115. doi:10.1109/QRS-C.2018.00032

work page doi:10.1109/qrs-c.2018.00032 2018
[41]

Jeongju Sohn, Sungmin Kang, and Shin Yoo. 2023. Arachne: Search-Based Re- pair of Deep Neural Networks.ACM Transactions on Software Engineering and Methodology32, 4 (May 2023), 85:1–85:26. doi:10.1145/3563210

work page doi:10.1145/3563210 2023
[42]

Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. InProceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Montreal, Quebec, Canada, 1245–1256. doi:10.1109/ICSE. 2019.00126

work page doi:10.1109/icse 2019
[43]

Yichun Wang, Zhiyi Zhang, Yongming Yao, and Zhiqiu Huang. 2023. A Fine- Grained Evaluation of Mutation Operators for Deep Learning Systems: A Selec- tive Mutation Approach. InProceedings of the 14th Asia-Pacific Symposium on Internetware. ACM, Hangzhou China, 123–133. doi:10.1145/3609437.3609453

work page doi:10.1145/3609437.3609453 2023
[44]

Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Virtual Event USA, 788–799. doi:10.1145/3368089.3409761

work page doi:10.1145/3368089.3409761 2020
[45]

Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing Test Inputs for Deep Neural Networks via Mutation Preprint, April, 2026 Ahmed et al. Analysis. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, Madrid, Spain, 397–409. doi:10.1109/ICSE43902.2021.00046 ISSN: 1558-1225

work page doi:10.1109/icse43902.2021.00046 2021
[46]

Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. InProceedings of the 43rd International Conference on Software Engineering (ICSE ’21). IEEE Press, Madrid, Spain, 251–262. doi:10.1109/ICSE43902.2021.00034

work page doi:10.1109/icse43902.2021.00034 2021
[47]

Huanhuan Wu, Zheng Li, Zhanqi Cui, and Jiaming Zhang. 2021. A Mutation- based Approach to Repair Deep Neural Network Models. In2021 8th International Conference on Dependable Systems and Their Applications (DSA). IEEE, Yinchuan, China, 730–731. doi:10.1109/DSA52907.2021.00106 ISSN: 2767-6684

work page doi:10.1109/dsa52907.2021.00106 2021
[48]

Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing.Journal of Systems and Software84, 4 (April 2011), 544–558. doi:10.1016/j.jss.2010.11.920

work page doi:10.1016/j.jss.2010.11.920 2011
[49]

Yinjie Xue, Zhiyi Zhang, Chen Liu, Shuxian Chen, and Zhiqiu Huang. 2024. DeepWeak: Weak Mutation Testing for Deep Learning Systems. In2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS). IEEE, Cambridge, United Kingdom, 49–60. doi:10.1109/QRS62785.2024.00015 ISSN: 2693-9177

work page doi:10.1109/qrs62785.2024.00015 2024
[50]

Zhiyi Zhang, Yichun Wang, Yongming Yao, Ziyuan Wang, and Zhiqiu Huang
[51]

2025), 63

A fine-grained evaluation of mutation operators to boost mutation testing for deep learning systems.Empirical Software Engineering30, 3 (Feb. 2025), 63. doi:10.1007/s10664-025-10613-5

work page doi:10.1007/s10664-025-10613-5 2025