Recognition: unknown
Quality-Driven Selective Mutation for Deep Learning
Pith reviewed 2026-05-08 11:11 UTC · model grok-4.3
The pith
A probabilistic framework ranks deep-learning mutation operators by resistance and realism to cut generated mutants by up to 55.6 percent while preserving quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that mutant quality in deep learning can be quantified probabilistically by combining resistance, defined through statistical killing probabilities, and realism, captured by generalized Jaccard similarity on detectability patterns between mutants and real faults. This dual measure allows selection of high-quality mutation-operator configurations that lower the volume of mutants generated and executed, with empirical results on CleanML, DeepFD, DeepLocalize, and defect4ML datasets confirming reductions up to 55.6 percent while holding resistance and realism steady under baseline-aligned thresholds.
What carries the argument
The probabilistic framework that scores mutation-operator configurations on resistance via killing probabilities and on realism via generalized Jaccard similarity of detectability patterns to enable ranking and filtering of high-quality configurations.
If this is right
- High-quality operator configurations reduce the number of mutants while retaining their effectiveness for guiding test improvement.
- The same configurations maintain comparable realism when mutants act as substitutes for real DL faults.
- Operator selection works without requiring a commitment to one particular role for the mutants.
- The preservation of resistance and realism levels holds when validated on held-out fault data.
Where Pith is reading between the lines
- If detectability patterns prove consistent across additional DL applications, the similarity metric could guide operator choice in new projects even with sparse fault data.
- The reduction in mutant count could make full mutation testing practical for larger DL models where complete generation was previously too expensive.
- The framework might be combined with other DL debugging methods that already use mutants to further lower overall testing cost.
- Collecting more real DL fault data over time could refine the realism measure and improve selection accuracy.
Load-bearing premise
The assumption that generalized Jaccard similarity on detectability patterns serves as a valid proxy for how well mutants substitute for real deep learning faults, and that the four chosen datasets represent the faults encountered in practice.
What would settle it
Applying the selected high-quality operators versus the full set to a new deep learning system, then measuring whether the selected mutants show markedly lower statistical killing resistance or lower Jaccard similarity to independently collected real faults in that system, would falsify the claim of preserved quality.
Figures
read the original abstract
Mutants support testing and debugging in two roles: (i) as test goals and (ii) as substitutes for real faults. Hard-to-kill mutants provide better guidance for test improvement, while realism is essential when mutants are used to simulate real bugs. Building on these roles, selective mutation for deep learning (DL) aims to reduce the cost of mutant generation and execution by choosing operator configurations that yield resistant and realistic mutants. However, the DL literature lacks a unified measure that captures both aspects. This study presents a probabilistic framework to quantify mutant quality along two complementary axes: resistance and realism. Resistance adapts the classical notion of hard-to-kill mutants to the DL setting using statistical killing probabilities, while realism is measured via the generalized Jaccard similarity between mutant and real-fault detectability patterns. The framework enables ranking and filtering of low-quality mutation-operator configurations without assuming a specific use case. We empirically evaluate the approach on four datasets of real DL faults. Three datasets (CleanML, DeepFD, and DeepLocalize) are used to estimate and select high-quality operator configurations, and the held-out defect4ML dataset is used for validation. Results show that quality-driven selection reduces the number of generated mutants by up to 55.6% while preserving typical levels of resistance and realism under baseline-aligned selection thresholds. These findings confirm that dual-objective selection can lower cost without compromising the usefulness of mutants for either role.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a probabilistic framework for quality-driven selective mutation in deep learning. It quantifies mutant quality along two axes—resistance (via statistical killing probabilities adapted to DL) and realism (via generalized Jaccard similarity on detectability patterns between mutants and real faults)—to rank and filter mutation-operator configurations. Using three datasets (CleanML, DeepFD, DeepLocalize) for estimation and selection, with defect4ML held out for validation, the authors report that the approach reduces generated mutants by up to 55.6% while preserving typical levels of resistance and realism under baseline-aligned thresholds, supporting both test-guidance and fault-substitution roles.
Significance. If the central empirical result holds, the framework could meaningfully lower the cost of mutation testing for DL systems, making it more practical to use mutants for improving test suites and simulating real faults. The held-out validation across four real DL fault datasets and the dual-objective (resistance + realism) selection are strengths that distinguish this from prior single-axis selective mutation work.
major comments (2)
- §3.2 (realism quantification): The central claim that quality-driven selection preserves mutant usefulness as substitutes for real DL faults rests on generalized Jaccard similarity of detectability patterns being a valid proxy for realism. No external validation is provided showing correlation between high Jaccard scores and actual behavioral equivalence (e.g., comparable accuracy drops or prediction shifts induced by the mutant versus the corresponding real fault). This assumption is load-bearing for interpreting the 55.6% reduction as maintaining both roles.
- Results section (defect4ML validation): The held-out check confirms that selected configurations retain similar Jaccard scores to the training datasets, but this only verifies consistency of the proxy, not that the proxy tracks real-fault likeness. Without an additional behavioral-impact comparison (e.g., against model-output divergence metrics), the preservation of 'typical levels of realism' does not fully support the fault-substitution usefulness claim.
minor comments (2)
- Abstract: The summary states the 55.6% reduction and framework but omits any reference to the key probabilistic equations or the exact definition of baseline-aligned thresholds, reducing accessibility for readers assessing the technical contribution from the abstract alone.
- Notation and figures: Ensure all symbols for killing probabilities and Jaccard indices are defined consistently; some figure captions could more explicitly link plotted quantities back to the resistance and realism axes.
Simulated Author's Rebuttal
Thank you for the constructive review and the recommendation for major revision. We appreciate the focus on the validity of the realism proxy and its implications for the fault-substitution claim. Below we respond point by point to the major comments, offering clarifications grounded in the manuscript while acknowledging where additional discussion is warranted.
read point-by-point responses
-
Referee: §3.2 (realism quantification): The central claim that quality-driven selection preserves mutant usefulness as substitutes for real DL faults rests on generalized Jaccard similarity of detectability patterns being a valid proxy for realism. No external validation is provided showing correlation between high Jaccard scores and actual behavioral equivalence (e.g., comparable accuracy drops or prediction shifts induced by the mutant versus the corresponding real fault). This assumption is load-bearing for interpreting the 55.6% reduction as maintaining both roles.
Authors: We thank the referee for this observation. The generalized Jaccard similarity is defined on detectability patterns precisely because these patterns encode observable test outcomes for both mutants and real faults, providing a direct, dataset-agnostic measure of how similarly they would guide test improvement or serve as fault substitutes. This choice aligns with the two roles articulated in the introduction and avoids requiring model-specific internal metrics that may not generalize across the four DL fault datasets. We acknowledge that an explicit correlation analysis with behavioral metrics such as accuracy drops or prediction shifts would offer complementary evidence. In the revised version we will expand the discussion in §3.2 to articulate the rationale for the proxy, its connection to behavioral equivalence via test exposure, and the limitations of the current validation. revision: partial
-
Referee: Results section (defect4ML validation): The held-out check confirms that selected configurations retain similar Jaccard scores to the training datasets, but this only verifies consistency of the proxy, not that the proxy tracks real-fault likeness. Without an additional behavioral-impact comparison (e.g., against model-output divergence metrics), the preservation of 'typical levels of realism' does not fully support the fault-substitution usefulness claim.
Authors: We agree that the held-out results on defect4ML primarily establish consistency of the selected operator configurations under the proposed dual-axis metric rather than an independent verification of the proxy itself. The reported preservation of typical Jaccard levels therefore supports generalizability of the quality-driven selection procedure, but the fault-substitution interpretation remains tied to the realism axis as defined. We will revise the Results section to make this scoping explicit, clarify that claims about usefulness for fault substitution are made with respect to the Jaccard-based realism measure, and add a brief forward-looking statement on the value of future behavioral-impact studies. revision: partial
Circularity Check
No significant circularity: held-out validation measures reduction independently after selection on separate datasets
full rationale
The paper defines resistance via statistical killing probabilities and realism via generalized Jaccard similarity on detectability patterns, then uses three datasets (CleanML, DeepFD, DeepLocalize) to select high-quality operator configurations and applies them to the held-out defect4ML dataset for validation. The reported up to 55.6% reduction in generated mutants is a direct count of configurations retained under the selection thresholds, not a fitted parameter or prediction derived from the same inputs. Preservation of resistance and realism is checked as an independent measurement on the validation set rather than being tautological by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the derivation; the framework remains self-contained with external held-out evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- baseline-aligned selection thresholds
axioms (1)
- domain assumption The four datasets (CleanML, DeepFD, DeepLocalize, defect4ML) collectively represent typical DL faults.
Reference graph
Works this paper leans on
-
[1]
Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, and Dayi Lin. 2024. TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks.IEEE Transactions on Software Engineering50, 12 (2024), 1–23. doi:10.1109/TSE.2024.3482984
-
[2]
Mohamed Abdelaal, Christian Hammacher, and Harald Schöning. 2023. REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines. doi:10.48786/EDBT.2023.43
-
[3]
Zaheed Ahmed and Philip Makedonski. 2024. Exploring the Fundamentals of Mutations in Deep Neural Networks. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (MODELS Companion ’24). Association for Computing Machinery, New York, NY, USA, 227–233. doi:10.1145/3652620.3687426
-
[4]
Zaheed Ahmed, Philip Makedonski, and Jens Grabowski. 2025. An Empirical Study of the Realism of Mutants in Deep Learning. doi:10.48550/arXiv.2512.16741 arXiv:2512.16741 [cs]. Quality-Driven Selective Mutation for Deep Learning Preprint, April, 2026
-
[5]
Barr, René Just, and Charles Sutton
Miltiadis Allamanis, Earl T. Barr, René Just, and Charles Sutton. 2016. Tailored Mutants Fit Bugs Better. doi:10.48550/arXiv.1611.02516 arXiv:1611.02516 [cs]
-
[6]
David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas Reps. 2017. The care and feeding of wild-caught mutants. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 511–522. doi:10.1145/3106237. 3106280
-
[7]
Jialun Cao, Meiziniu Li, Xiao Chen, Ming Wen, Yongqiang Tian, Bo Wu, and Shing-Chi Cheung. 2022. DeepFD: automated fault diagnosis and localization for deep learning programs. InProceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 573–585. doi:10.1145/3510003.3510099
-
[8]
Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, Macau SAR, China, 1190–1201. doi:10.1109/ICDE.2019.00109 ISSN: 2375-026X
-
[9]
Sangwan, Youakim Badr, and Satish M
Harsh Deokuliar, Raghvinder S. Sangwan, Youakim Badr, and Satish M. Srinivasan
-
[10]
2023), Pages 60:54–Pages 60:65
Improving Testing of Deep-learning Systems: A combination of differential and mutation testing results in better test data.Queue21, 5 (Nov. 2023), Pages 60:54–Pages 60:65. doi:10.1145/3631340
-
[11]
Antonia Estero-Botaro, Francisco Palomo-Lozano, Inmaculada Medina-Bulo, Juan José Domínguez-Jiménez, and Antonio García-Domínguez. 2015. Qual- ity metrics for mutation testing with applications to WS-BPEL compositions. Software Testing, Verification and Reliability25, 5-7 (2015), 536–571. doi:10.1002/ stvr.1528 _eprint: https://onlinelibrary.wiley.com/doi...
-
[12]
Li-Chao Feng, Xing-Ya Wang, Shi-Yu Zhang, Rui-Zhi Gao, and Zhi-Hong Zhao
-
[13]
https://jit.ndhu.edu.tw/article/view/2705 Number: 3
Mutation Operator Reduction for Cost-effective Deep Learning Software Testing via Decision Boundary Change Measurement.Journal of Internet Technol- ogy23, 3 (May 2022), 601–610. https://jit.ndhu.edu.tw/article/view/2705 Number: 3
2022
-
[14]
Ali Ghanbari. 2024. Decomposition of Deep Neural Networks into Modules via Mutation Analysis. InProceedings of the 33rd ACM SIGSOFT International Sympo- sium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1669–1681. doi:10.1145/3650212.3680390
-
[15]
Ali Ghanbari, Deepak-George Thomas, Muhammad Arbab Arshad, and Hridesh Rajan. 2023. Mutation-based Fault Localization of Deep Neural Networks. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, Luxembourg City, Luxembourg, 1301–1313. doi:10.1109/ ASE56229.2023.00171
-
[16]
Qiang Hu, Lei Ma, Xiaofei Xie, Bing Yu, Yang Liu, and Jianjun Zhao. 2019. Deep- Mutation++: A Mutation Testing Framework for Deep Learning Systems. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, 1158–1161. doi:10.1109/ASE.2019.00126
-
[17]
Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Denmark, 67–78. doi:10.1145/3460319.3464825
-
[18]
Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2023. DeepCrime: from Real Faults to Mutation Testing Tool for Deep Learning. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceed- ings (ICSE-Companion). IEEE, Melbourne, Australia, 68–72. doi:10.1109/ICSE- Companion58688.2023.00027
-
[19]
Gunel Jahangirova, Andrea Stocco, and Paolo Tonella. 2021. Quality Metrics and Oracles for Autonomous Vehicles Testing. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, Porto de Galinhas, Brazil, 194–204. doi:10.1109/ICST49551.2021.00030 ISSN: 2159-4848
-
[20]
Gunel Jahangirova and Paolo Tonella. 2020. An Empirical Evaluation of Mutation Operators for Deep Learning Systems. In2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE, Porto, Portugal, 74–84. doi:10.1109/ICST46399.2020.00018 ISSN: 2159-4848
-
[21]
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing.IEEE Transactions on Software Engineering37, 5 (Sept. 2011), 649–678. doi:10.1109/TSE.2010.62
-
[22]
Ernst, Reid Holmes, and Gordon Fraser
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in soft- ware testing?. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654–665. doi:1...
-
[23]
Kaufman, Ryan Featherman, Justin Alvin, Bob Kurtz, Paul Ammann, and René Just
Samuel J. Kaufman, Ryan Featherman, Justin Alvin, Bob Kurtz, Paul Ammann, and René Just. 2022. Prioritizing Mutants to Guide Mutation Testing. In2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). ACM, Pittsburgh, PA, USA, 1743–1754. doi:10.1145/3510003.3510187 ISSN: 1558-1225
-
[24]
Jinhan Kim, Nargiz Humbatova, Gunel Jahangirova, Shin Yoo, and Paolo Tonella
-
[25]
doi:10.48550/arXiv.2501.09846 arXiv:2501.09846 [cs]
MuFF: Stable and Sensitive Post-training Mutation Testing for Deep Learn- ing. doi:10.48550/arXiv.2501.09846 arXiv:2501.09846 [cs]
-
[26]
Delamaro, Mariet Kurtz, and Nida Gökçe
Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe. 2016. Analyzing the validity of selective mutation with dominator mutants. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). Association for Computing Machinery, New York, NY, USA, 571–582. doi:10.1145/...
-
[27]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, Chania, Greece, 13–24. doi:10.1109/ICDE51399.2021.00009 ISSN: 2375-026X
-
[28]
Yanhui Li, Weijun Shen, Tengchao Wu, Lin Chen, Di Wu, Yuming Zhou, and Baowen Xu. 2022. How higher order mutant testing performs for deep learning models: A fine-grained evaluation of test effectiveness and efficiency improved from second-order mutant-classification tuples.Information and Software Tech- nology150 (Oct. 2022), 106954. doi:10.1016/j.infsof....
-
[29]
Renhao Lin, Qinglei Zhou, Bin Wu, and Xiaofei Nan. 2022. Robustness evaluation for deep neural networks via mutation decision boundaries analysis.Information Sciences601 (July 2022), 147–161. doi:10.1016/j.ins.2022.04.020
-
[30]
Lauren Lyons and Ali Ghanbari. 2025. On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, Naples, Italy, 267–278. doi:10.1109/ICST62969.2025.10989014 ISSN: 2159-4848
-
[31]
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). IEEE Computer Society, Memphis, TN, USA, 100–111. doi:10.1109/ISSRE.2018.00021
-
[32]
Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming (Jack) Jiang. 2023. Bugs in machine learning-based systems: a faultload benchmark.Empirical Software Engineering28, 3 (April 2023), 62. doi:10.1007/ s10664-023-10291-1
2023
-
[33]
Manuel Méndez, Miguel Benito-Parejo, and Mercedes G. Merayo. 2024. Test- ing the Robustness of Machine Learning Models Through Mutations. InAd- vances in Computational Collective Intelligence, Ngoc-Than Nguyen, Bogdan Franczyk, André Ludwig, Manuel Nunez, Jan Treur, Gottfried Vossen, and Adrianna Kozierkiewicz (Eds.). Springer Nature Switzerland, Cham, 30...
-
[34]
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation Testing Advances: An Analysis and Survey. InAdvances in Computers. Vol. 112. Elsevier, Amsterdam, The Netherlands, 275–378. doi:10. 1016/bs.adcom.2018.03.015
2019
-
[35]
Goran Petrović, Marko Ivanković, Gordon Fraser, and René Just. 2022. Practical Mutation Testing at Scale: A view from Google.IEEE Trans. Softw. Eng.48, 10 (Oct. 2022), 3900–3912. doi:10.1109/TSE.2021.3107634
-
[36]
Maryam Vahdat Pour, Zhuo Li, Lei Ma, and Hadi Hemmati. 2021. A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding. In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, Porto de Galinhas, Brazil, 36–46. doi:10.1109/ICST49551. 2021.00016 ISSN: 2159-4848
-
[37]
Vincenzo Riccio, Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella
-
[38]
In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
DeepMetis: Augmenting a Deep Learning Test Set to Increase its Mutation Score. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, Melbourne, Australia, 355–367. doi:10. 1109/ASE51524.2021.9678764
- [39]
-
[40]
Weijun Shen, Jun Wan, and Zhenyu Chen. 2018. MuNN: Mutation Analysis of Neural Networks. In2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, Lisbon, Portugal, 108–115. doi:10.1109/QRS-C.2018.00032
-
[41]
Jeongju Sohn, Sungmin Kang, and Shin Yoo. 2023. Arachne: Search-Based Re- pair of Deep Neural Networks.ACM Transactions on Software Engineering and Methodology32, 4 (May 2023), 85:1–85:26. doi:10.1145/3563210
-
[42]
Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. InProceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Montreal, Quebec, Canada, 1245–1256. doi:10.1109/ICSE. 2019.00126
-
[43]
Yichun Wang, Zhiyi Zhang, Yongming Yao, and Zhiqiu Huang. 2023. A Fine- Grained Evaluation of Mutation Operators for Deep Learning Systems: A Selec- tive Mutation Approach. InProceedings of the 14th Asia-Pacific Symposium on Internetware. ACM, Hangzhou China, 123–133. doi:10.1145/3609437.3609453
-
[44]
Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Virtual Event USA, 788–799. doi:10.1145/3368089.3409761
-
[45]
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing Test Inputs for Deep Neural Networks via Mutation Preprint, April, 2026 Ahmed et al. Analysis. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, Madrid, Spain, 397–409. doi:10.1109/ICSE43902.2021.00046 ISSN: 1558-1225
-
[46]
Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. InProceedings of the 43rd International Conference on Software Engineering (ICSE ’21). IEEE Press, Madrid, Spain, 251–262. doi:10.1109/ICSE43902.2021.00034
-
[47]
Huanhuan Wu, Zheng Li, Zhanqi Cui, and Jiaming Zhang. 2021. A Mutation- based Approach to Repair Deep Neural Network Models. In2021 8th International Conference on Dependable Systems and Their Applications (DSA). IEEE, Yinchuan, China, 730–731. doi:10.1109/DSA52907.2021.00106 ISSN: 2767-6684
-
[48]
Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing.Journal of Systems and Software84, 4 (April 2011), 544–558. doi:10.1016/j.jss.2010.11.920
-
[49]
Yinjie Xue, Zhiyi Zhang, Chen Liu, Shuxian Chen, and Zhiqiu Huang. 2024. DeepWeak: Weak Mutation Testing for Deep Learning Systems. In2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS). IEEE, Cambridge, United Kingdom, 49–60. doi:10.1109/QRS62785.2024.00015 ISSN: 2693-9177
-
[50]
Zhiyi Zhang, Yichun Wang, Yongming Yao, Ziyuan Wang, and Zhiqiu Huang
-
[51]
A fine-grained evaluation of mutation operators to boost mutation testing for deep learning systems.Empirical Software Engineering30, 3 (Feb. 2025), 63. doi:10.1007/s10664-025-10613-5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.