pith. sign in

arxiv: 2503.14852 · v2 · pith:3T63JM3Ynew · submitted 2025-03-19 · 💻 cs.SE

UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models

Pith reviewed 2026-05-23 00:05 UTC · model grok-4.3

classification 💻 cs.SE
keywords vulnerability detectionuntrustworthy predictionsmachine learningdependency graphscode analysisalert filteringsoftware security
0
0 comments X

The pith

UntrustVul identifies untrustworthy vulnerability predictions by flagging lines that neither match historical vulnerability patterns nor influence any lines that do through dependency graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UntrustVul as a method to automatically detect when machine learning models for finding code vulnerabilities are highlighting lines that have nothing to do with actual vulnerabilities. This matters because such misleading alerts force developers to spend extra time checking them and can lead to incorrect fixes. The approach learns patterns from past vulnerable code and then checks the highlighted lines against those patterns while also following data and control dependencies to see if the line affects any relevant code. A prediction is marked untrustworthy only when the line fails the pattern match and every successor line in the graph also fails it. The design stays conservative so it rarely labels a correct alert as untrustworthy.

Core claim

UntrustVul detects untrustworthy alerts by defining a line as vulnerability-irrelevant when it does not resemble patterns from historical vulnerable lines and all its successors in the data and control dependency graph are likewise vulnerability-irrelevant. This rule lets the method label a model prediction as untrustworthy without needing ground-truth labels on the current example. Evaluation on 115K predictions from four models across BigVul, MegaVul, SARD, and PrimeVul datasets yields AUC scores of 70%-88% and F1-scores of 82%-94%, exceeding prior methods by 6%-59% in AUC and 13%-92% in F1-score.

What carries the argument

The recursive definition of a vulnerability-irrelevant line: mismatch with historical patterns plus irrelevance of all successors in the data and control dependency graph.

If this is right

  • Developers can skip manual inspection of lines marked untrustworthy and focus effort on the remaining alerts.
  • Incorrect patching decisions triggered by irrelevant highlighted lines become less likely.
  • The same filtering step can be applied to any vulnerability detection model that outputs line-level alerts.
  • Performance holds across multiple public datasets and detector architectures without retraining the original models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern-plus-graph test could be tried on other ML tasks that produce line-level explanations, such as defect prediction.
  • If dependency graphs miss indirect influences, some truly relevant lines may still be flagged as irrelevant.

Load-bearing premise

Patterns drawn from historical vulnerable lines together with data and control dependency graphs can separate vulnerability-irrelevant lines from relevant ones without access to ground-truth labels on the current prediction.

What would settle it

A manually labeled set of model predictions where each highlighted line has been checked by experts for actual relevance to the vulnerability; if UntrustVul's untrustworthy flags show low agreement with these labels, the central claim fails.

Figures

Figures reproduced from arXiv: 2503.14852 by Aldeida Aleti, Lam Nguyen Tung, Neelofar Neelofar, Xiaoning Du.

Figure 1
Figure 1. Figure 1: A prediction by LineVul [13] with its attention￾based interpretation at the line level, where darker shading indicates tokens with higher contributions. 1 INTRODUCTION Machine learning (ML) has been widely used in vulnerability de￾tection to identify weaknesses in information systems, security protocols, or implementations that can be exploited by a threat source [24]. Recent studies [4, 6, 13, 65] have sh… view at source ↗
Figure 2
Figure 2. Figure 2: The distribution of predictions’ confidence across [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The process of UntrustVul. A lower T indicates the prediction is more likely untrustworthy. Fig￾ure 4 shows the process of UntrustVul, which consists of three steps: code parsing, line-level assessment, and dependency-level assessment. The last two steps assess ① and ②, respectively. 3.1 Source Code Parsing UntrustVul takes as input a vulnerability prediction and its cor￾responding source code. The predict… view at source ↗
Figure 5
Figure 5. Figure 5: Dataset construction in line-level assessment. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparing the average effectiveness of UntrustVul and the baselines across different values of IoU threshold [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Machine learning (ML) has shown promise in vulnerability detection, but ML detectors may rely on irrelevant code features, causing them to highlight non-vulnerable lines as suspicious. Such misleading predictions increase developers' manual effort and may lead to incorrect patching strategies, motivating the need to identify untrustworthy predictions automatically. We present UntrustVul, an approach for detecting untrustworthy vulnerability predictions by identifying suspicious lines that are inherently unrelated to vulnerabilities. UntrustVul leverages patterns from historical vulnerable lines and flags predictions as untrustworthy when the highlighted lines neither match known vulnerability patterns nor influence lines that do. A line is considered vulnerability-irrelevant if it does not resemble historical vulnerabilities and all its successors in the data and control dependency graph are also vulnerability-irrelevant. The approach is designed conservatively to minimise misclassifying trustworthy predictions as untrustworthy. We evaluate UntrustVul on 115K predictions from four models across the BigVul, MegaVul, SARD, and PrimeVul datasets. Results show that UntrustVul achieves AUC scores of 70%-88% and F1-scores of 82%-94%, outperforming existing approaches by 6%-59% in AUC and 13%-92% in F1-score.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces UntrustVul, an automated method to flag untrustworthy vulnerability predictions from ML detectors. It identifies suspicious lines as untrustworthy if they neither resemble historical vulnerable lines nor influence (via PDG successors) lines that do, using a conservative recursive definition based on data/control dependency graphs. Evaluation on 115K predictions from four models across BigVul, MegaVul, SARD, and PrimeVul reports AUC of 70-88% and F1 of 82-94%, outperforming baselines by 6-59% AUC and 13-92% F1.

Significance. If the operationalization of resemblance and historical/test separation can be made reproducible without leakage, the approach could meaningfully reduce developer effort on false-positive alerts in vulnerability detection. The conservative design to avoid misclassifying trustworthy predictions is a methodological strength, and the scale of the evaluation (four datasets, four models) is appropriate for the claim.

major comments (3)
  1. [Abstract] Abstract and method description: the central performance claims (AUC 70-88%, outperformance margins) rest on an unspecified definition of 'resemble historical vulnerabilities' and on the exact procedure for extracting historical patterns; without an equation, feature set, or similarity metric, it is impossible to verify that the 115K test predictions are strictly separated from the historical set.
  2. [Evaluation] Evaluation section: no details are given on baseline implementations, exact PDG construction algorithm, or how historical data is partitioned from the evaluated predictions on BigVul/MegaVul/SARD/PrimeVul; this directly undermines the reported 6-59% AUC gains.
  3. [Approach] The recursive definition of vulnerability-irrelevant lines (a line is irrelevant if it does not resemble history and all successors are irrelevant) is load-bearing for the conservative claim, yet no pseudocode, termination condition, or handling of cycles in the PDG is provided.
minor comments (2)
  1. [Approach] Notation for PDG successors and 'resemble' should be formalized with a small example in the method section.
  2. [Results] Table or figure reporting per-dataset and per-model breakdowns would clarify whether the aggregate 70-88% AUC holds uniformly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the central performance claims (AUC 70-88%, outperformance margins) rest on an unspecified definition of 'resemble historical vulnerabilities' and on the exact procedure for extracting historical patterns; without an equation, feature set, or similarity metric, it is impossible to verify that the 115K test predictions are strictly separated from the historical set.

    Authors: We agree that the description of resemblance and historical pattern extraction requires more precision to allow verification of no leakage. In the revision we will add the exact feature set, similarity metric, and partitioning procedure used to separate historical data from the 115K test predictions. revision: yes

  2. Referee: [Evaluation] Evaluation section: no details are given on baseline implementations, exact PDG construction algorithm, or how historical data is partitioned from the evaluated predictions on BigVul/MegaVul/SARD/PrimeVul; this directly undermines the reported 6-59% AUC gains.

    Authors: We acknowledge the missing implementation details. The revised manuscript will specify baseline implementations, the PDG construction algorithm and tool, and the precise historical/test partitioning method applied to each of the four datasets. revision: yes

  3. Referee: [Approach] The recursive definition of vulnerability-irrelevant lines (a line is irrelevant if it does not resemble history and all successors are irrelevant) is load-bearing for the conservative claim, yet no pseudocode, termination condition, or handling of cycles in the PDG is provided.

    Authors: We will add pseudocode for the recursive procedure, explicitly state the termination condition, and describe cycle handling (via a visited-node set) to prevent infinite recursion. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on external benchmarks

full rationale

The paper presents UntrustVul as a rule-based heuristic that flags lines as vulnerability-irrelevant when they neither resemble historical vulnerable lines nor propagate influence via PDG successors. No equations, fitted parameters, or self-referential definitions appear in the provided text; performance (AUC/F1 on 115K predictions across BigVul/MegaVul/SARD/PrimeVul) is reported as direct empirical measurement against external datasets. No self-citation chains, ansatz smuggling, or renaming of known results are present. The derivation is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that historical vulnerable lines form a sufficient basis for pattern matching and that dependency graphs capture influence on vulnerabilities. No free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Historical vulnerable lines provide representative patterns for identifying irrelevant lines in new code.
    Invoked in the definition of vulnerability-irrelevant lines.
  • domain assumption Data and control dependency graphs accurately reflect influence relationships relevant to vulnerabilities.
    Used to propagate irrelevance through successors.

pith-pipeline@v0.9.0 · 5770 in / 1299 out tokens · 31684 ms · 2026-05-23T00:05:56.470291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 3 internal anchors

  1. [1]

    Software Assurance Reference Dataset

    2017. Software Assurance Reference Dataset. https://samate.nist.gov/SARD

  2. [2]

    2023. Joern. https://joern.io

  3. [3]

    Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machine...

  4. [4]

    Deep learning based vulnerability detection: Are we there yet?

    S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engi- neering 48, 09 (sep 2022), 3280–3296. doi:10.1109/TSE.2021.3087402

  5. [5]

    Baijun Cheng, Shengming Zhao, Kailong Wang, Meizhen Wang, Guangdong Bai, Ruitao Feng, Yao Guo, Lei Ma, and Haoyu Wang. 2024. Beyond Fidelity: Explaining Vulnerability Localization of Learning-Based Detectors. ACM Trans. Softw. Eng. Methodol. 33, 5, Article 127 (June 2024), 33 pages. doi:10.1145/3641543

  6. [6]

    Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and Yulei Sui. 2021. DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network. ACM Trans. Softw. Eng. Methodol. 30, 3, Article 38 (April 2021), 33 pages. doi:10. 1145/3436877

  7. [7]

    Zhaoyang Chu, Yao Wan, Qian Li, Yang Wu, Hongyu Zhang, Yulei Sui, Guan- dong Xu, and Hai Jin. 2024. Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation. In Proceedings of the 33rd ACM SIGSOFT Inter- national Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024). Association for Computing Machinery, New York,...

  8. [8]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023. Data Quality for Software Vulnerability Datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . 121–133. doi:10.1109/ICSE48619.2023.00022

  9. [9]

    Mengnan Du, Ruixiang Tang, Weijie Fu, and Xia Hu. 2022. Towards Debiasing DNN Models from Spurious Feature Influence.Proceedings of the AAAI Conference on Artificial Intelligence 36, 9 (Jun. 2022), 9521–9528. doi:10.1609/aaai.v36i9.21185

  10. [10]

    Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 508–512. doi:10.1145/3379597.3387501

  11. [11]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 . Association for Computa- tional Linguistics, Online, 1536–1547. doi:10.18...

  12. [12]

    Ottenstein, and Joe D

    Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July 1987), 319–349. doi:10.1145/24039.24041

  13. [13]

    Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer- based line-level vulnerability prediction. In Proceedings of the 19th Interna- tional Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 608–620. doi:10.1145/3524842.3528452

  14. [14]

    Tom Ganz, Martin Härterich, Alexander Warnecke, and Konrad Rieck. 2021. Explaining Graph Neural Networks for Vulnerability Discovery. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (Virtual Event, Republic of Korea) (AISec ’21). Association for Computing Machinery, New York, NY, USA, 145–156. doi:10.1145/3474369.3486866

  15. [15]

    Shuzheng Gao, Cuiyun Gao, Chaozheng Wang, Jun Sun, David Lo, and Yue Yu. 2023. Two Sides of the Same Coin: Exploiting the Impact of Identifiers in Neural Code Comprehension. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1933–1945. doi:10.1109/ICSE48619.2023.00164

  16. [16]

    Vera Liao, Yunfeng Zhang, Rachel Bellamy, and Klaus Mueller

    Bhavya Ghai, Q. Vera Liao, Yunfeng Zhang, Rachel Bellamy, and Klaus Mueller

  17. [17]

    Explainable Active Learning (XAL): Toward AI Explanations as Interfaces for Machine Teachers. Proc. ACM Hum.-Comput. Interact. 4, CSCW3, Article 235 (jan 2021), 28 pages. doi:10.1145/3432934

  18. [18]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Pro- ceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers) . Association for Computational Linguistics, Dublin, Ireland, 7212–7225. doi:10.18653/v1/2022.a...

  19. [19]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)

  20. [20]

    LineVD: Statement- level vulnerability detection using graph neural networks,

    David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar. 2022. LineVD: statement-level vulnerability detection using graph neural networks. In Pro- ceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 596–607. doi:10.1145/3524842.3527949

  21. [21]

    Yutao Hu, Suyuan Wang, Wenke Li, Junru Peng, Yueming Wu, Deqing Zou, and Hai Jin. 2023. Interpreters for GNN-Based Vulnerability Detection: Are We There Yet?. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (Seattle, WA, USA) (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 1407–1419. doi...

  22. [22]

    Qiang Huang, Makoto Yamada, Yuan Tian, Dinesh Singh, and Yi Chang. 2023. GraphLIME: Local Interpretable Model Explanations for Graph Neural Networks. IEEE Transactions on Knowledge and Data Engineering 35, 7 (2023), 6968–6972. doi:10.1109/TKDE.2022.3187455

  23. [23]

    Chaudhari, Curtis P

    Shih-Cheng Huang, Akshay S. Chaudhari, Curtis P. Langlotz, Nigam Shah, Serena Yeung, and Matthew P. Lungren. 2022. Developing medical imaging AI for emerging infectious diseases. Nature Communications 13, 1 (18 Nov 2022), 7060. doi:10.1038/s41467-022-34234-4

  24. [24]

    Erik Imgrund, Tom Ganz, Martin Härterich, Lukas Pirch, Niklas Risse, and Konrad Rieck. 2023. Broken Promises: Measuring Confounding Effects in Learning- based Vulnerability Discovery. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security(Copenhagen, Denmark)(AISec ’23). Association for Computing Machinery, New York, NY, USA, 149–...

  25. [25]

    Arnold Johnson, Kelley L

    L. Arnold Johnson, Kelley L. Dempsey, Ronald S. Ross, Sarbari Gupta, and Dennis Bailey. 2011. Guide for Security-Focused Configuration Management of Information Systems. Technical Report. Gaithersburg, MD, USA

  26. [26]

    Rittichier, and Arjan Durresi

    Davinder Kaur, Suleyman Uslu, Kaley J. Rittichier, and Arjan Durresi. 2022. Trust- worthy Artificial Intelligence: A Review. ACM Comput. Surv. 55, 2, Article 39 (jan 2022), 38 pages. doi:10.1145/3491209

  27. [27]

    Lena Kästner, Markus Langer, Veronika Lazar, Astrid Schomäcker, Timo Speith, and Sarah Sterz. 2021. On the Relation of Trust and Explainability: Why to Engineer for Trustworthiness. InIEEE 29th International Requirements Engineering Conference Workshops. 169–175. doi:10.1109/REW53955.2021.00031

  28. [28]

    Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2019. Unmasking Clever Hans predic- tors and assessing what machines really learn. Nature Communications 10, 1 (11 Mar 2019), 1096. doi:10.1038/s41467-019-08987-4

  29. [29]

    Florin Leon, Sabina-Adriana Floria, and Costin Bădică. 2017. Evaluating the effect of voting methods on ensemble-based classification. In 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA) . 1–6. doi:10.1109/INISTA.2017.8001122

  30. [30]

    Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou. 2023. Trustworthy AI: From Principles to Practices. ACM Comput. Surv. 55, 9, Article 177 (jan 2023), 46 pages. doi:10.1145/3555803

  31. [31]

    Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding Neural Networks through Representation Erasure. CoRR abs/1612.08220 (2016). http://dblp.uni- trier.de/db/journals/corr/corr1612.html#LiMJ16a

  32. [32]

    Yi Li, Shaohua Wang, and Tien N. Nguyen. 2021. Vulnerability detection with fine-grained interpretations. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Comput- ing Machinery, New York, NY, USA, 292–303. doi:...

  33. [33]

    Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2022. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 4 (2022), 2244–2258. doi:10.1109/TDSC.2021.3051525

  34. [34]

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings 2018 Network and Distributed System Security Symposium (NDSS 2018) . Internet Society. doi:10.14722/ndss.2018.23158 11

  35. [35]

    Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing. 2022. Towards a roadmap on software engineering for responsible AI. InProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI (Pittsburgh, Pennsylvania) (CAIN ’22). Association for Computing Machinery, New York, NY, USA, 101–112. doi:10.1145/3522664.3528607

  36. [36]

    Dongsheng Luo, Wei Cheng, Dongkuan Xu, Wenchao Yu, Bo Zong, Haifeng Chen, and Xiang Zhang. 2020. Parameterized Explainer for Graph Neural Network. In Advances in Neural Information Processing Systems , Vol. 33. Curran Associates, Inc., 19620–19631. https://proceedings.neurips.cc/paper_files/paper/2020/file/ e37b08dd3015330dcbb5d6663667b8b8-Paper.pdf

  37. [37]

    Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Soft- ware Engineering for AI-Based Systems: A Survey. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 37e (April 2022), 59 pages. doi:10.1145/3487043

  38. [38]

    Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 427–436. doi:10.1109/CVPR.2015.7298640

  39. [39]

    Nguyen and Raymond Choo

    Tien N. Nguyen and Raymond Choo. 2022. Human-in-the-loop XAI-enabled vulnerability detection, investigation, and mitigation. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (Mel- bourne, Australia) (ASE ’21). IEEE Press, 1210–1212. doi:10.1109/ASE51524.2021. 9678840

  40. [40]

    Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, and Shaohua Wang. 2024. MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations. In 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR). 738–742

  41. [41]

    Chao Ni, Xin Yin, Kaiwen Yang, Dehai Zhao, Zhenchang Xing, and Xin Xia. 2023. Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (San Francisco, CA, USA) (ESE...

  42. [42]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi:10. 3115/1073083.1073135

  43. [43]

    Pope, Soheil Kolouri, Mohammad Rostami, Charles E

    Phillip E. Pope, Soheil Kolouri, Mohammad Rostami, Charles E. Martin, and Heiko Hoffmann. 2019. Explainability Methods for Graph Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  44. [44]

    Md Mahbubur Rahman, Ira Ceka, Chengzhi Mao, Saikat Chakraborty, Baishakhi Ray, and Wei Le. 2024. Towards Causal Deep Learning for Vulnerability Detec- tion. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 153, 11 pages. doi:1...

  45. [45]

    Why Should I Trust You?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, USA, 1135–1144. doi:10.1145/2939672.2939778

  46. [46]

    Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empirical Software Engineering 25, 6 (01 Nov 2020), 5193–

  47. [47]

    doi:10.1007/s10664-020-09881-0

  48. [48]

    Niklas Risse and Marcel Böhme. 2024. Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection. arXiv:2408.12986 [cs.CR] https://arxiv.org/abs/2408.12986

  49. [49]

    Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, and Baishakhi Ray

  50. [50]

    Agen- tic ai software engineer: Programming with trust,

    AI Software Engineer: Programming with Trust. arXiv:2502.13767 [cs.SE] https://arxiv.org/abs/2502.13767

  51. [51]

    Patrick Schramowski, Wolfgang Stammer, Stefano Teso, Anna Brugger, Franziska Herbert, Xiaoting Shao, Hans-Georg Luigs, Anne-Katrin Mahlein, and Kristian Kersting. 2020. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence 2, 8 (01 Aug 2020), 476–486. doi:10.1038/s42256-020-0212-3

  52. [52]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE Inter- national Conference on Computer Vision (ICCV)

  53. [53]

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv:1312.6034 [cs.CV] https://arxiv.org/abs/1312.6034

  54. [54]

    Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023. An Empirical Study of Deep Learning Models for Vulnerability Detection. In Pro- ceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2237–2248. doi:10.1109/ICSE48619. 2023.00188

  55. [55]

    Benjamin Steenhoek, Kalpathy Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, and Wei Le. 2025. Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE. arXiv:2412.14306 [cs.SE] https://arxiv.org/abs/ 2412.14306

  56. [56]

    Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. 2016. End-To-End People Detection in Crowded Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  57. [57]

    Szymon Stradowski and Lech Madeyski. 2024. Interpretability/Explainability Ap- plied to Machine Learning Software Defect Prediction: An Industrial Perspective. IEEE Software (2024), 1–8. doi:10.1109/MS.2024.3505544

  58. [58]

    Scott Thiebes, Sebastian Lins, and Ali Sunyaev. 2021. Trustworthy artificial intelligence. Electronic Markets 31, 2 (01 Jun 2021), 447–464. doi:10.1007/s12525- 020-00441-4

  59. [59]

    Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, and Aldeida Aleti. 2024. Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers. arXiv:2410.22663 [cs.SE] https://arxiv.org/abs/2410.22663

  60. [60]

    Minh Vu and My T. Thai. 2020. PGM-Explainer: Probabilistic Graph- ical Model Explanations for Graph Neural Networks. In Advances in Neural Information Processing Systems , Vol. 33. Curran Associates, Inc., 12225–12235. https://proceedings.neurips.cc/paper_files/paper/2020/file/ 8fb134f258b1f7865a6ab2d935a897c9-Paper.pdf

  61. [61]

    Tan Wang, Chang Zhou, Qianru Sun, and Hanwang Zhang. 2021. Causal Atten- tion for Unbiased Visual Recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) . 3091–3100

  62. [62]

    John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4344–4355. doi:10.18653/v1/P19-1427

  63. [63]

    Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang. 2024. Spurious Correlations in Machine Learning: A Survey. arXiv:2402.12715 [cs.LG] https://arxiv.org/abs/2402.12715

  64. [64]

    Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec

  65. [65]

    In Advances in Neural Information Processing Systems , Vol

    GNNExplainer: Generating Explanations for Graph Neural Net- works. In Advances in Neural Information Processing Systems , Vol. 32. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/ d80b7040b773199015de6d3b4293c8ff-Paper.pdf

  66. [66]

    Jian Zhang, Shangqing Liu, Xu Wang, Tianlin Li, and Yang Liu. 2023. Learning to Locate and Describe Vulnerabilities. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . 332–344. doi:10.1109/ ASE56229.2023.00045

  67. [67]

    Yue Zhang, David Defazio, and Arti Ramesh. 2021. RelEx: A Model-Agnostic Relational Model Explainer. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (Virtual Event, USA) (AIES ’21). Association for Computing Machinery, New York, NY, USA, 1042–1049. doi:10.1145/3461702.3462562

  68. [68]

    Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A Dataset Built for AI- Based Vulnerability Detection Methods Using Differential Analysis. InProceedings of the ACM/IEEE 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’...

  69. [69]

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. De- vign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Advances in Neural Information Pro- cessing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/ paper_files/paper/2019/file/49265d2447bc3b...

  70. [70]

    Deqing Zou, Yutao Hu, Wenke Li, Yueming Wu, Haojun Zhao, and Hai Jin

  71. [71]

    IEEE Transactions on Dependable and Secure Computing (2022), 1–12

    mVulPreter: A Multi-Granularity Vulnerability Detection System With Interpretations. IEEE Transactions on Dependable and Secure Computing (2022), 1–12. doi:10.1109/TDSC.2022.3199769 12