Recognition: unknown
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts
Pith reviewed 2026-05-08 08:00 UTC · model grok-4.3
The pith
Test selection metrics for deep learning show inconsistent performance depending on objectives, OOD shifts, and data modalities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.
What carries the argument
The multi-objective multi-scenario empirical benchmark that varies testing goals, distribution shifts, modalities, and models to compare the 15 metrics.
If this is right
- Metric effectiveness for fault detection does not necessarily carry over to performance estimation or retraining guidance.
- Results obtained on image data may not transfer to text or Android package modalities.
- Metrics suited to adversarial or corrupted shifts may behave differently under natural or label shifts.
- Practitioners gain evidence to select metrics matched to their particular objective and expected shifts.
Where Pith is reading between the lines
- The benchmark could be extended with newer metrics or larger-scale models to test whether current patterns persist.
- Safety-critical testing pipelines might benefit from objective-specific selection logic rather than a single default metric.
- The study suggests opportunities to design hybrid metrics that adapt when multiple objectives or shift types are present simultaneously.
Load-bearing premise
The chosen 15 metrics, three objectives, five OOD types, three modalities, and 13 models are sufficiently representative that the resulting rankings and statistical findings generalize to other DL systems and real-world deployment contexts.
What would settle it
A replication study that uses a fresh set of models or additional real-world OOD cases and finds substantially different performance orderings among the metrics for the same objectives.
Figures
read the original abstract
Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that prior evaluations of test selection metrics for deep learning systems suffer from narrow testing objectives, limited OOD scenario coverage, and biased dataset selection. To address this gap, the authors conduct a large-scale empirical study evaluating 15 existing metrics under three testing objectives (fault detection, performance estimation, retraining guidance), five OOD scenarios (corrupted, adversarial, temporal, natural, label shifts), three data modalities (image, text, Android packages), and 13 DL models, for a total of 1,640 experimental scenarios, accompanied by statistical analysis to guide metric selection.
Significance. If the empirical findings hold and generalize, this work would provide substantial practical value by delivering a unified benchmark that clarifies which test selection metrics are effective under varying objectives and distribution shifts in safety-critical DL applications. The scale of 1,640 scenarios across multiple modalities is a notable strength that could support more informed practitioner decisions than narrower prior studies.
major comments (2)
- [Abstract] Abstract: The description of the experimental design provides no details on the statistical methods used for analysis, multiple-testing corrections applied, or the precise implementation of the 15 metrics. Without this information, the reliability of the claimed 'comprehensive evaluation and statistical analysis' across 1,640 scenarios cannot be assessed.
- [Study design] Study design (as described in the abstract and implied experimental setup): The central claim of offering generalizable insights from a unified benchmark rests on the assumption that the chosen 15 metrics, 13 models, five OOD types, and three modalities are representative. The manuscript does not appear to include sensitivity analyses (e.g., swapping in additional architectures such as Vision Transformers or varying dataset sizes) to verify stability of the metric rankings, which is load-bearing for the generalization of the findings.
minor comments (1)
- [Abstract] Abstract: The sentence fragment 'with natural and label shifts are rarely considered' contains a grammatical error and should be rephrased for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the scale and potential practical value of our 1,640-scenario benchmark. We address each major comment below with honest revisions where the manuscript can be strengthened without misrepresenting our work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The description of the experimental design provides no details on the statistical methods used for analysis, multiple-testing corrections applied, or the precise implementation of the 15 metrics. Without this information, the reliability of the claimed 'comprehensive evaluation and statistical analysis' across 1,640 scenarios cannot be assessed.
Authors: We agree the abstract's brevity omits these specifics. The full manuscript details the statistical approach in Section 3 (non-parametric tests including Wilcoxon signed-rank with Bonferroni correction for multiple comparisons across objectives and shifts) and metric implementations in Section 4 plus Appendix A (following original papers with our noted adaptations for each modality). In revision we will expand the abstract with a concise clause referencing the statistical framework and directing readers to those sections. revision: yes
-
Referee: [Study design] Study design (as described in the abstract and implied experimental setup): The central claim of offering generalizable insights from a unified benchmark rests on the assumption that the chosen 15 metrics, 13 models, five OOD types, and three modalities are representative. The manuscript does not appear to include sensitivity analyses (e.g., swapping in additional architectures such as Vision Transformers or varying dataset sizes) to verify stability of the metric rankings, which is load-bearing for the generalization of the findings.
Authors: Our 13 models were selected for diversity across modalities and architectures (CNNs, RNNs, and transformers where available in each domain) to reflect common practice; the five OOD types and 15 metrics likewise follow prevalence in prior work. The manuscript does not contain explicit sensitivity analyses such as adding Vision Transformers or dataset-size variations. We will add a dedicated paragraph in the Discussion section acknowledging this limitation, discussing potential impacts on ranking stability based on the observed consistency across the existing 1,640 scenarios, and outlining why the current selection supports the reported insights. revision: partial
Circularity Check
No circularity: purely empirical evaluation with direct experimental outcomes
full rationale
The paper performs an empirical study by selecting 15 metrics, running them on 13 models across 3 objectives, 5 OOD types, and 3 modalities (1640 scenarios total), then reporting statistical rankings. No equations, derivations, fitted parameters, or predictions appear; results are measured outcomes rather than re-statements of inputs. Prior-work citations only motivate the gap and do not load-bear any uniqueness theorem or ansatz. The representativeness concern is a validity issue, not a circular reduction of the claimed findings to the chosen setups by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 15 chosen metrics adequately represent the space of existing test selection approaches in the literature.
- domain assumption The five listed OOD scenarios and three modalities capture the distribution shifts and data types relevant to safety-critical DL deployment.
Reference graph
Works this paper leans on
-
[1]
Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, S Ramesh, and Mojtaba Bagherzadeh. 2023. Black-box testing of deep neural networks through test case diversity.IEEE Transactions on Software Engineering49, 5 (2023), 3182–3204. doi:10.1109/TSE.2023.3243522
-
[2]
Zohreh Aghababaeyan, Manel Abdellatif, Mahboubeh Dadkhah, and Lionel Briand. 2024. Deepgd: A multi-objective black-box test selection approach for deep neural networks.ACM Transactions on Software Engineering and Methodology 33, 6 (2024), 1–29. doi:10.1145/3644388
-
[3]
Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of Android apps for the research community. InProceedings of the 13th international conference on mining software repositories. 468–471. doi:10.1145/2901739.2903508
-
[4]
Kylie Anglin, Qing Liu, and Vivian C Wong. 2024. A primer on the validity typology and threats to validity in education research.Asia Pacific Education Review25, 3 (2024), 557–574. doi:10.1007/s12564-024-09955-4
-
[5]
Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket.. InNdss, Vol. 14. 23–26. doi:10.14722/ndss.2014. 23247
-
[6]
Mohammed Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2023. Black-box safety analysis and retraining of DNNs based on feature extraction and clustering.ACM Transactions on Software Engineering and Methodology32, 3 (2023), 1–40. doi:10.1145/3550271
-
[7]
Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2024. Supporting safety analysis of image-processing dnns through clustering-based approaches.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–48. doi:10.1145/3643671
-
[8]
David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, and Jianjun Zhao. 2020. Cats are not fish: Deep learning testing calls for out-of-distribution awareness. InProceedings of the 35th IEEE/ACM international conference on automated software engineering. 1041–1052. doi:10.1145/3324884.3416609
-
[9]
Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. InProceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 245–250. doi:10.1145/502512.502546
-
[10]
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316(2016). https://arxiv.org/abs/1604.07316
work page internal anchor Pith review arXiv 2016
-
[11]
Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 63–70. doi:10.1109/AITest.2019.000-6
-
[12]
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing.ACM Transactions on Software Engineering and Methodology (TOSEM)29, 4 (2020), 1–35. doi:10.1145/3394112
-
[13]
Lingjiao Chen, Matei Zaharia, and James Y Zou. 2022. Estimating and explaining model performance when both covariates and labels shift.Advances in Neural Information Processing Systems35 (2022), 11467–11479. https: //dl.acm.org/doi/10.5555/3600270.3601103
- [14]
-
[15]
Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. 2017. EMNIST: Extending MNIST to hand- written letters.2017 International Joint Conference on Neural Networks (IJCNN)(2017). doi:10.1109/ijcnn.2017.7966217
-
[16]
Demet Demir, Aysu Betin Can, and Elif Surer. 2024. Test selection for deep neural networks using meta-models with uncertainty metrics. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690. doi:10.1145/3650212.3680312
-
[17]
K. P. Devakumar. 2020. IMDB Review Classification LSTM, GRU, CNN, GloVe. https://www.kaggle.com/code/imdevskp/imdb-review-classification-lstm-gru-cnn-glove/
2020
-
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. doi:10.18653...
-
[19]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, Vol. 96. 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507
-
[20]
Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 177–188. doi:10.1145/3395363.3397357 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058...
-
[21]
Xinyu Gao, Yang Feng, Yining Yin, Zixi Liu, Zhenyu Chen, and Baowen Xu. 2022. Adaptive test selection for deep neural networks. InProceedings of the 44th international conference on software engineering. 73–85. doi:10.1145/3510003.3510232
-
[22]
Jacob Gildenblat. 2016. Visualizations for understanding the regressed wheel steering angle for self-driving cars. https://github.com/jacobgil/keras-steering-angle-visualizations
2016
-
[23]
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572(2014). https://doi.org/10.48550/arXiv.1412.6572
work page internal anchor Pith review doi:10.48550/arxiv.1412.6572 2014
-
[24]
Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. 2017. Adversarial examples for malware detection. InComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part II 22. Springer, 62–79. doi:10.1007/978-3-319- 66399-9_4
-
[26]
Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2024. DeepSample: DNN sampling-based testing for operational accuracy assessment. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12. doi:10.1109/ICSE43902.2021.00042
-
[27]
Chris Gundling. 2016. Steering angle model: Cg32.https://github.com/udacity/self-driving-car/tree/master/steering- models/community-models/cg23
2016
-
[28]
Yao Hao, Zhiqiu Huang, Hongjing Guo, and Guohua Shen. 2023. Test input selection for deep neural network enhancement based on multiple-objective optimization. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 534–545. doi:10.1109/SANER56733.2023.00056
-
[29]
Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks?. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 851–862. doi:10.1145/3368089.3409754
-
[30]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. doi:10.1109/CVPR.2016.90
-
[31]
Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136(2016). doi:10.48550/arXiv.1610.02136
work page internal anchor Pith review doi:10.48550/arxiv.1610.02136 2016
-
[32]
Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. InProceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 168–177. doi:10.1145/1014052.1014073
-
[33]
Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An empirical study on data distribution-aware test selection for deep learning enhancement.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 4 (2022), 1–30. doi:10.1145/3511598
-
[34]
Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, and Yves Le Traon. 2024. Test optimization in dnn testing: A survey.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–42. doi:10.1145/ 3643678
2024
-
[35]
Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Wei Ma, Mike Papadakis, Lei Ma, and Yves Le Traon. 2025. Assessing the Robustness of Test Selection Methods for Deep Neural Networks.ACM Transactions on Software Engineering and Methodology(2025). doi:10.1145/3715693
- [36]
-
[37]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1039–1049. doi:10.1109/ICSE.2019.00108
-
[38]
Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. 2020. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1466–1476. doi:10.1145/3368089. 3417065
-
[39]
Ronald S King. 2013. Cluster analysis and data mining: An introduction. (2013). https://dl.acm.org/doi/abs/10.5555/ 2823861
2013
-
[40]
Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. InArtificial intelligence safety and security. Chapman and Hall/CRC, 99–112. doi:10.1201/9781351251389-8
-
[41]
Yann LeCun. 1998. The MNIST database of handwritten digits.http://yann. lecun. com/exdb/mnist/(1998). doi:10. 24432/C53K8Q
1998
-
[42]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.Proc. IEEE86, 11 (1998), 2278–2324. doi:10.1109/5.726791 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058. Publication date: July 2026. Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribut...
-
[43]
Deqiang Li and Qianmu Li. 2020. Adversarial deep ensemble: Evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security15 (2020), 3886–3900. doi:10.1109/TIFS.2020.3003571
-
[44]
Deqiang Li, Tian Qiu, Shuo Chen, Qianmu Li, and Shouhuai Xu. 2021. Can we leverage predictive uncertainty to detect dataset shift and adversarial examples in android malware detection?. InProceedings of the 37th Annual Computer Security Applications Conference. 596–608. doi:10.1145/3485832.3485916
-
[45]
Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, and Tegawendé F Bissyandé. 2024. Prioritizing test cases for deep learning-based video classifiers.Empirical Software Engineering29, 5 (2024), 111. doi:10.1007/s10664-024-10520-1
-
[46]
Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Yves Le Traon, and Tegawende F Bissyande. 2024. Test input prioritization for 3d point clouds.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–44. doi:10.1145/3643676
-
[47]
Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational dnn testing efficiency through conditioning. InProceedings of the 2019 27th ACM Joint meeting on European software engineering conference and symposium on the foundations of software engineering. 499–509. doi:10.1145/3338906.3338930
-
[48]
Jingjing Liang, Sebastian Elbaum, and Gregg Rothermel. 2018. Redefining prioritization: continuous prioritization for continuous integration. InProceedings of the 40th International Conference on Software Engineering. 688–698. doi:10.1145/3180155.3180213
-
[49]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019). https://doi.org/10.48550/arXiv.1907.11692
work page internal anchor Pith review doi:10.48550/arxiv.1907.11692 2019
-
[50]
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al
-
[51]
InProceedings of the 33rd ACM/IEEE international conference on automated software engineering
Deepgauge: Multi-granularity testing criteria for deep learning systems. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 120–131. doi:10.1145/3238147.3238202
-
[52]
Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–22. doi:10.1145/3417330
-
[53]
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 142–150. https://dl.acm.org/doi/10.5555/2002472.2002491
-
[54]
James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298. https://api.semanticscholar.org/CorpusID:6278891
1967
-
[55]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083(2017). https://doi.org/10.48550/arXiv. 1706.06083
work page internal anchor Pith review doi:10.48550/arxiv 2017
-
[56]
Leland McInnes, John Healy, Steve Astels, et al. 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205. doi:10.21105/joss.00205
-
[57]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018). doi:10.21105/joss.00861
-
[58]
Niall McLaughlin, Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima, Paul Miller, Sakir Sezer, Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupé, et al. 2017. Deep android malware detection. InProceedings of the seventh ACM on conference on data and application security and privacy. 301–308. doi:10.1145/3029806.3029823
-
[59]
Ecker, Matthias Bethge, and Wieland Brendel
Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge, and Wieland Brendel. 2019. Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming.arXiv preprint arXiv:1907.07484(2019). https://doi.org/10.48550/arXiv.1907.07484
-
[60]
Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification.Pattern recognition45, 1 (2012), 521–530. doi:10.1016/j.patcog.2011.06.019
-
[61]
Vasilii Mosin, Miroslaw Staron, Darko Durisic, Francisco Gomes de Oliveira Neto, Sushant Kumar Pandey, and Ashok Chaitanya Koppisetty. 2022. Comparing input prioritization techniques for testing deep learning algorithms. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 76–83. doi:10.1109/ SEAA56994.2022.00020
-
[62]
Davoud Moulavi, Pablo A Jaskowiak, Ricardo JGB Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. InProceedings of the 2014 SIAM international conference on data mining. SIAM, 839–847. doi:10. 1137/1.9781611973440.96
2014
-
[63]
Norman Mu and Justin Gilmer. 2019. Mnist-c: A robustness benchmark for computer vision.arXiv preprint arXiv:1906.02337(2019). https://doi.org/10.48550/arXiv.1906.02337
-
[64]
Apoorv Nandan. 2020. Text classification with Transformer.https://keras.io/examples/nlp/text_classification_with_ transformer/. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058. Publication date: July 2026. FSE058:24 Jingyu Zhang, Fan Wang, Jacky Keung, Yihan Liao, Yan Xiao, and Lei Ma
2020
-
[65]
Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin philosophical magazine and journal of science2, 11 (1901), 559–572. doi:10.1080/14786440109462720
-
[66]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles. 1–18. doi:10.1145/3361566
-
[67]
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. InProceedings of the 57th annual meeting of the association for computational linguistics. 1085–1097. https://api.semanticscholar.org/CorpusID:196202909
2019
-
[68]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7
-
[69]
Alexandru Constantin Serban, Erik Poll, and Joost Visser. 2018. Adversarial examples-a complete characterisation of the phenomenon.arXiv preprint arXiv:1810.01185(2018). doi:10.48550/arXiv.1810.01185
-
[70]
Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 410–422. doi:10.1145/3324884.3416621
-
[71]
Ying Shi, Beibei Yin, Zheng Zheng, and Tiancheng Li. 2021. An empirical study on test case prioritization metrics for deep neural networks. In2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 157–166. doi:10.1109/QRS54544.2021.00027
-
[72]
Alex Staravoitau. 2016. Behavioral cloning: end-to-end learning for self-driving cars. https://github.com/navoshta/behavioral-cloning
2016
-
[73]
Sully-Chen. 2016. Autopilot-Tensorflow.https://github.com/SullyChen/Autopilot-TensorFlow
2016
-
[74]
Weifeng Sun, Meng Yan, Zhongxin Liu, and David Lo. 2023. Robust test selection for deep neural networks.IEEE Transactions on Software Engineering49, 12 (2023), 5250–5278. doi:10.1109/TSE.2023.3330982
-
[75]
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2016. An empirical comparison of model validation techniques for defect prediction models.IEEE Transactions on Software Engineering43, 1 (2016), 1–18. doi:10.1109/TSE.2016.2584050
-
[76]
TextAttack. [n. d.]. Transforms an input by replacing any word with ’banana’.https://textattack.readthedocs.io/en/latest/
-
[77]
Udacity. 2016. Using Deep Learning to Predict Steering Angles. https://medium.com/udacity/challenge-2-using-deep- learning-to-predict-steering-angles-f42004a36ff3
2016
-
[78]
Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In2014 International joint conference on neural networks (IJCNN). IEEE, 112–119. doi:10.1109/IJCNN.2014.6889457
-
[79]
Zhiyu Wang, Sihan Xu, Lingling Fan, Xiangrui Cai, Linyu Li, and Zheli Liu. 2024. Can coverage criteria guide failure discovery for image classifiers? an empirical study.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–28. doi:10.1145/3672446
-
[80]
Michael Weiss and Paolo Tonella. 2022. Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study). InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 139–150. doi:10.1145/3533767.3534375
-
[81]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems27 (2014). https://api.semanticscholar.org/CorpusID:362467
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.