pith. sign in

arxiv: 2403.16032 · v3 · submitted 2024-03-24 · 💻 cs.SE

DeepFWI: Identifying Bug-Sensitive Warnings with Multi-Modal Code-Warning Semantics

Pith reviewed 2026-05-24 02:47 UTC · model grok-4.3

classification 💻 cs.SE
keywords static analysisbug detectionfalse warningsmachine learningLSTMmulti-modal semanticssoftware engineering
0
0 comments X

The pith

DeepFWI identifies true bug warnings at fine granularity by learning multi-modal semantics from code and static analysis alerts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepFWI to separate warnings that actually signal bugs from the flood of false positives produced by automated static analysis tools. Earlier learning methods operated at coarse levels such as whole functions or long-term trends and either used hand-crafted features or code alone, limiting their sensitivity to individual issues. DeepFWI instead trains an LSTM that ingests both source code and warning text, using cross-attention to surface their joint patterns. A newly assembled dataset of 280,273 warnings supplies the training signal, and the model reaches 67.06 percent F1 on confirming true warnings while also surfacing real bugs when run on four open-source projects.

Core claim

DeepFWI is an LSTM-based model that captures multi-modal semantics of source code and warnings from automated static analysis tools and highlights their correlations with cross-attention. Trained and evaluated on a collected dataset of 280,273 warnings, the model achieves an F1-score of 67.06 percent for confirming true warnings in a finer-grained manner and outperforms all baselines. When applied to four popular open-source projects, it filters the vast majority of warnings while still surfacing 25 true bug-related warnings confirmed by manual analysis.

What carries the argument

LSTM model with cross-attention that fuses multi-modal semantics from source code and warning messages to correlate them with actual bugs.

If this is right

  • The fine-grained identification allows developers to focus review effort on a much smaller set of likely-true warnings.
  • Application to real projects demonstrates practical filtering that retains confirmed bugs while discarding most false alarms.
  • The multi-modal cross-attention design directly addresses the limitations of prior coarse-grained or single-modality approaches.
  • Outperformance of baselines holds across the collected dataset of over 280,000 warnings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding the model inside existing static analysis pipelines could reduce developer fatigue and increase tool adoption.
  • Similar datasets and models could be built for additional languages or analyzer families to test generalization.
  • Performance might improve if the training data were expanded with warnings from more diverse project domains.

Load-bearing premise

Manual labeling of the 280k warnings produces accurate ground truth without systematic bias, and the collected warnings are representative of those encountered in unseen projects.

What would settle it

Independent re-labeling of a held-out subset of the 280k warnings by multiple experts, followed by re-running the trained model to check whether the reported F1-score holds or drops substantially.

Figures

Figures reproduced from arXiv: 2403.16032 by Cen Zhang, Han Liu, Jian Zhang, Kaixuan Li, Sen Chen, Shang-Wei Lin, Xiaohan Zhang, Xinhua Li, Yang Liu, Yixiang Chen.

Figure 1
Figure 1. Figure 1: The process of the data collection W𝑏 = {𝑤 ∈ W | ∃(𝐶𝑏 ,𝐶𝑓 ) ∈ H : 𝑤 ∈ SA(𝐶𝑏 ) and 𝑤 ∉ SA(𝐶𝑓 )}, in which (𝐶𝑏 ,𝐶𝑓 ) represents a bug in code 𝐶𝑏 that is fixed in the corresponding code 𝐶𝑓 . With the warnings W = {𝑤1,𝑤2, ...,𝑤𝑁 } and the corresponding code snippet C = {𝑐1, 𝑐2, ..., 𝑐𝑁 } as the input, the target is to distinguish the warning is bug-sensitive or bug-insensitive. Alternatively, a classifier mode… view at source ↗
Figure 2
Figure 2. Figure 2: The framework of our approach disappears in the fixed version, we mark it as a bug-sensitive warning. Conversely, we interpreted that such a warning had no correlation to the specific bug, designating it as a bug-insensitive warning. Given that a single file might contain multiple bugs, which may not necessarily fixed in a single commit, our collection faced a challenge. Some warnings may have been flagged… view at source ↗
read the original abstract

Static analysis tools have evolved over time to assist in detecting bugs. However, the excessive false warnings can impede developers' productivity and confidence in the tools. Previous research efforts have explored learning-based approaches to identify bug warnings. Nevertheless, their coarse granularity, focusing on either long-term warnings or function-level alerts, is insensitive to individual bugs. Also, they rely on manually crafted features or solely on source code semantics, which is inadequate for effective learning. In this paper, we propose DeepFWI, a learning-based approach that identifies bug-sensitive warnings at a fine-grained granularity. Specifically, we design a novel LSTM-based model that captures multi-modal semantics of source code and warnings from automated static analysis tools (ASATs) and highlights their correlations with cross-attention. To tackle the data scarcity of training and evaluation, we collected a large-scale dataset of 280,273 warnings. We conducted extensive experiments on the dataset to evaluate DeepFWI. The experimental results demonstrate the effectiveness of our approach, with an F1-score 67.06% for confirming true warnings in a finer-grained manner, significantly outperforming all baselines. Additionally, to validate the practicality of DeepFWI from the perspective of developers, we applied DeepFWI to four popular open-source projects. Our approach filtered out the vast majority of warnings, while still successfully surfacing 25 true bug-related warnings that were confirmed through manual analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents DeepFWI, an LSTM-based model augmented with cross-attention to jointly encode multi-modal semantics from source code and static-analysis warnings, with the goal of identifying bug-sensitive warnings at fine granularity. The authors report collecting a dataset of 280273 warnings, achieving an F1-score of 67.06% that outperforms baselines, and, in a real-world deployment on four open-source projects, surfacing 25 manually confirmed bug-related warnings after filtering the majority of alerts.

Significance. If the ground-truth labels prove reliable and the experimental protocol is reproducible, the work could have practical significance for improving the signal-to-noise ratio of static-analysis tools. The scale of the collected dataset and the end-to-end deployment that yielded confirmed bugs are concrete strengths that would support adoption if the labeling and evaluation details are strengthened.

major comments (2)
  1. [Dataset construction and evaluation (abstract and §4–5)] The central empirical claim (F1 = 67.06 % and superiority over baselines) rests entirely on supervised learning from a manually labeled corpus of 280273 warnings. The manuscript supplies no annotation protocol, number of annotators, inter-annotator agreement statistics, or label-validation procedure. This omission is load-bearing for every reported performance number and for the claim of “finer-grained” superiority.
  2. [Experimental setup (§5)] The experimental protocol is described at too high a level to assess validity: data-split strategy, exact baseline re-implementations, hyper-parameter search, and any post-hoc filtering of the test set are not reported. Without these details the 67.06 % F1 cannot be interpreted as evidence of a methodological advance.
minor comments (2)
  1. [Abstract] The abstract asserts that DeepFWI “significantly outperforming all baselines” yet neither names the baselines nor supplies the corresponding F1 values.
  2. [Model description (§3)] Notation for the cross-attention module and the precise definition of “warning-sensitive” versus “bug-sensitive” should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments correctly identify omissions in the original manuscript regarding dataset labeling and experimental reproducibility. We will revise the paper to address both points in full.

read point-by-point responses
  1. Referee: [Dataset construction and evaluation (abstract and §4–5)] The central empirical claim (F1 = 67.06 % and superiority over baselines) rests entirely on supervised learning from a manually labeled corpus of 280273 warnings. The manuscript supplies no annotation protocol, number of annotators, inter-annotator agreement statistics, or label-validation procedure. This omission is load-bearing for every reported performance number and for the claim of “finer-grained” superiority.

    Authors: We agree that the annotation protocol, annotator count, inter-annotator agreement, and validation procedure were not reported. This information is essential for assessing label quality. In the revised manuscript we will add a dedicated subsection in §4 that describes the full labeling process, the number of annotators, the annotation guidelines, inter-annotator agreement statistics, and the label-validation steps performed. revision: yes

  2. Referee: [Experimental setup (§5)] The experimental protocol is described at too high a level to assess validity: data-split strategy, exact baseline re-implementations, hyper-parameter search, and any post-hoc filtering of the test set are not reported. Without these details the 67.06 % F1 cannot be interpreted as evidence of a methodological advance.

    Authors: We concur that the experimental protocol lacks the necessary detail for reproducibility. The revised §5 will explicitly state the train/validation/test split strategy (including how leakage was prevented), the precise re-implementations and hyper-parameter settings of each baseline, the hyper-parameter search procedure and ranges used for DeepFWI, and any post-hoc filtering applied to the test set. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper describes a standard supervised learning pipeline: manual collection and labeling of 280273 warnings as ground truth, followed by training an LSTM model with cross-attention on multi-modal features and reporting F1 on the dataset. No equations, self-citations, or procedures are present that reduce the reported F1-score to a fitted parameter or prior result by construction. The evaluation metric measures agreement with externally supplied labels rather than recovering any input quantity, satisfying the criteria for a self-contained empirical claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised-learning assumptions plus the availability of a large manually labeled warning dataset; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • LSTM and attention hyperparameters
    Hidden sizes, learning rate, and other architecture choices are tuned on the training portion of the 280k-warning dataset.
axioms (1)
  • domain assumption Manual analysis can produce reliable true/false labels for individual warnings.
    Required for the supervised training and evaluation setup described in the abstract.

pith-pipeline@v0.9.0 · 5811 in / 1193 out tokens · 25278 ms · 2026-05-24T02:47:52.402889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 6 internal anchors

  1. [1]

    Soot - A framework for analyzing and transforming Java and Android applications

    2023. Soot - A framework for analyzing and transforming Java and Android applications. https://soot-oss.github.io/soot/ (Accessed on 01/12/2023)

  2. [2]

    Edward Aftandilian, Raluca Sauciuc, Siddharth Priya, and Sundaresan Krishnan. 2012. Building Useful Program Analysis Tools Using an Extensible Java Compiler. In 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation . 14–23. https://doi.org/10.1109/SCAM. 2012.28

  3. [3]

    Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)

  4. [4]

    Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. 2021. Self-supervised bug detection and repair. Advances in Neural Information Processing Systems 34 (2021), 27865–27876

  5. [5]

    Lorena Arcega, Jaime Font, Øystein Haugen, and Carlos Cetina. 2021. Bug Localization in Model-Based Systems in the Wild. ACM Trans. Softw. Eng. Methodol. 31, 1, Article 10 (oct 2021), 32 pages. https://doi.org/10.1145/3472616

  6. [6]

    Andrea Arcuri, Man Zhang, and Juan Pablo Galeotti. 2024. Advanced White-Box Heuristics for Search-Based Fuzzing of REST APIs. ACM Trans. Softw. Eng. Methodol. (mar 2024). https://doi.org/10.1145/3652157 Just Accepted

  7. [7]

    David Morgenthaler, and John Penix

    Nathaniel Ayewah, William Pugh, David Hovemeyer, J. David Morgenthaler, and John Penix. 2008. Using Static Analysis to Find Bugs.IEEE Software 25, 5 (2008), 22–29. https://doi.org/10.1109/MS.2008.130

  8. [8]

    Vipin Balachandran. 2013. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In 2013 35th International Conference on Software Engineering (ICSE) . 931–940. https://doi.org/10.1109/ICSE.2013.6606642

  9. [9]

    Pavol Bielik, Veselin Raychev, and Martin Vechev. 2017. Learning a static analyzer from data. InComputer Aided Verification: 29th International Conference, CA V 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30 . Springer, 233–253

  10. [10]

    Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. 1992. Class-based n-gram models of natural language. Computational linguistics 18, 4 (1992), 467–480

  11. [11]

    Cristiano Calcagno, Dino Distefano, Jeremy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Purbrick, and Dulma Rodriguez. 2015. Moving Fast with Software Verification. In NASA Formal Methods, Klaus Havelund, Gerard Holzmann, and Rajeev Joshi (Eds.). Springer International Publishing, Cham, 3–11. https://d...

  12. [12]

    Yiu Wai Chow, Max Schäfer, and Michael Pradel. 2023. Beware of the Unexpected: Bimodal Taint Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (Seattle, WA, USA,) (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 211–222. https://doi.org/10.1145/3597926.3598050

  13. [13]

    Christoph Csallner and Yannis Smaragdakis. 2005. Check’n’Crash: Combining static checking and testing. InProceedings of the 27th international conference on Software engineering . 422–431

  14. [14]

    Mohan Cui, Chengjun Chen, Hui Xu, and Yangfan Zhou. 2023. SafeDrop: Detecting Memory Deallocation Bugs of Rust Programs via Static Data-flow Analysis. ACM Trans. Softw. Eng. Methodol. 32, 4, Article 82 (may 2023), 21 pages. https://doi.org/10.1145/3542948

  15. [15]

    Jayati Deshmukh, K. M. Annervaz, Sanjay Podder, Shubhashis Sengupta, and Neville Dubash. 2017. Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME) . 115–124. https://doi.org/10. 1109/ICSME.2017.69

  16. [16]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

  17. [17]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al . 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)

  18. [18]

    Lan-Zhe Guo and Yu-Feng Li. 2022. Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162) , Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 8082–8094. https://pr...

  19. [19]

    Liu Han, Chen Sen, Feng Ruitao, Liu Chengwei, Li Kaixuan, Xu Zhengzi, Nie Liming, Liu Yang, and Chen Yixiang. 2023. A Comprehensive Study on Quality Assurance Tools for Java. In Proceedings of the 32st ACM SIGSOFT International Symposium on Software Testing and Analysis (Seattle, United States) (ISSTA 2023). Association for Computing Machinery, New York, ...

  20. [20]

    Quinn Hanam, Lin Tan, Reid Holmes, and Patrick Lam. 2014. Finding Patterns in Static Analysis Alerts: Improving Actionable Alert Ranking. In Proceedings of the 11th Working Conference on Mining Software Repositories (Hyderabad, India) (MSR 2014). Association for Computing Machinery, New York, NY, USA, 152–161. https://doi.org/10.1145/2597073.2597100

  21. [21]

    Ahmed E. Hassan. 2008. Automated Classification of Change Messages in Open Source Projects. In Proceedings of the 2008 ACM Symposium on Applied Computing (Fortaleza, Ceara, Brazil) (SAC ’08). Association for Computing Machinery, New York, NY, USA, 837–841. https://doi.org/10. 1145/1363686.1363876

  22. [22]

    Sarah Heckman and Laurie Williams. 2009. A Model Building Process for Identifying Actionable Static Analysis Alerts. In 2009 International Conference on Software Testing Verification and Validation. 161–170. https://doi.org/10.1109/ICST.2009.45

  23. [23]

    Sarah Heckman and Laurie Williams. 2011. A systematic literature review of actionable alert identification techniques for automated static code analysis. Information and Software Technology 53, 4 (2011), 363–387. https://doi.org/10.1016/j.infsof.2010.12.007 Special section: Software Engineering track of the 24th Annual Symposium on Applied Computing. Manu...

  24. [24]

    David Hovemeyer and William Pugh. 2004. Finding Bugs is Easy. SIGPLAN Not. 39, 12 (dec 2004), 92–106. https://doi.org/10.1145/1052883.1052895

  25. [25]

    Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. 2013. Why don’t software developers use static analysis tools to find bugs?. In 2013 35th International Conference on Software Engineering (ICSE) . 672–681. https://doi.org/10.1109/ICSE.2013.6606613

  26. [26]

    Maximilian Junker, Ralf Huuck, Ansgar Fehnker, and Alexander Knapp. 2012. SMT-Based False Positive Elimination in Static Program Analysis. In Formal Methods and Software Engineering , Toshiaki Aoki and Kenji Taguchi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 316–331

  27. [27]

    Hong Jin Kang, Khai Loong Aw, and David Lo. 2022. Detecting False Alarms from Automatic Static Analysis Tools: How Far Are We?. InProceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 698–709. https://doi.org/10.1145/3510003.3510214

  28. [28]

    Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, and Neel Sundaresan. 2022. Learning to Reduce False Positives in Analytic Bug Detectors. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 13...

  29. [29]

    Sunghun Kim and Michael D. Ernst. 2007. Prioritizing Warning Categories by Analyzing Software History. In Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007). 27–27. https://doi.org/10.1109/MSR.2007.26

  30. [30]

    Sunghun Kim and Michael D. Ernst. 2007. Which Warnings Should I Fix First?. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering(Dubrovnik, Croatia) (ESEC-FSE ’07). Association for Computing Machinery, New York, NY, USA, 45–54. https://doi.org/1...

  31. [31]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]

  32. [32]

    In: 2019 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

    Ugur Koc, Shiyi Wei, Jeffrey S. Foster, Marine Carpuat, and Adam A. Porter. 2019. An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST) . 288–299. https://doi.org/10.1109/ICST.2019.00036

  33. [33]

    Kaituo Li, Christoph Reichenbach, Christoph Csallner, and Yannis Smaragdakis. 2014. Residual investigation: Predictive and precise bug detection. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 2 (2014), 1–32

  34. [34]

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018)

  35. [35]

    Guangtai Liang, Ling Wu, Qian Wu, Qianxiang Wang, Tao Xie, and Hong Mei. 2010. Automatic Construction of an Effective Training Set for Prioritizing Static Analysis Warnings. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (Antwerp, Belgium) (ASE ’10). Association for Computing Machinery, New York, NY, USA, 93...

  36. [36]

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. Focal Loss for Dense Object Detection. arXiv:1708.02002 [cs.CV]

  37. [37]

    Bailin Lu, Wei Dong, Liangze Yin, and Li Zhang. 2018. Evaluating and Integrating Diverse Bug Finders for Effective Program Analysis. In Software Analysis, Testing, and Evolution , Lei Bu and Yingfei Xiong (Eds.). Vol. 11293. Springer International Publishing, Cham, 51–67. https: //doi.org/10.1007/978-3-030-04272-1_4 Series Title: Lecture Notes in Computer Science

  38. [38]

    Thu-Trang Nguyen, Toshiaki Aoki, Takashi Tomita, and Iori Yamada. 2019. Multiple program analysis techniques enable precise check for SEI CERT C coding standard. In 2019 26th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 70–77

  39. [39]

    Thu Trang Nguyen, Pattaravut Maleehuan, Toshiaki Aoki, Takashi Tomita, and Iori Yamada. 2019. Reducing false positives of static analysis for sei cert c coding standard. In 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Studies in Industry (CESI) and 6th International Workshop on Software Engineering Research and Industrial Practic...

  40. [40]

    Chao Ni, Kaiwen Yang, Xin Xia, David Lo, Xiang Chen, and Xiaohu Yang. 2022. Defect Identification, Categorization, and Repair: Better Together. arXiv:2204.04856 [cs.SE]

  41. [41]

    Amin Nikanjam, Houssem Ben Braiek, Mohammad Mehdi Morovati, and Foutse Khomh. 2021. Automatic Fault Detection for Deep Learning Programs Using Graph Transformations. ACM Trans. Softw. Eng. Methodol. 31, 1, Article 14 (sep 2021), 27 pages. https://doi.org/10.1145/3470006

  42. [42]

    Oracle. 2022. Oracle Java Documentation. https://docs.oracle.com/javase/tutorial/java/javaOO/variables.html. (Accessed on 01/12/2023)

  43. [43]

    Sebastiano Panichella, Venera Arnaoudova, Massimiliano Di Penta, and Giuliano Antoniol. 2015. Would static analysis tools help developers with code reviews?. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER) . 161–170. https: //doi.org/10.1109/SANER.2015.7081826

  44. [44]

    Terence Parr and Sam Harwell. 2020. ANTLR 4. https://www.antlr.org/. (Accessed on 01/12/2023)

  45. [45]

    Maria Perez-Ortiz, P Tiňo, Rafal Mantiuk, and César Hervás-Martínez. 2019. Exploiting synthetically generated data with semi-supervised learning for small and imbalanced datasets. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33. 4715–4722

  46. [46]

    Chanathip Pornprasit and Chakkrit Kla Tantithamthavorn. 2022. Deeplinedp: Towards a deep learning approach for line-level defect prediction. IEEE Transactions on Software Engineering 49, 1 (2022), 84–98

  47. [47]

    Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29–48

  48. [48]

    Xavier Rival. 2005. Abstract dependences for alarm diagnosis. In Programming Languages and Systems: Third Asian Symposium, APLAS 2005, Tsukuba, Japan, November 2-5, 2005. Proceedings 3 . Springer, 347–363

  49. [49]

    Xavier Rival. 2005. Understanding the origin of alarms in Astrée. In Static Analysis: 12th International Symposium, SAS 2005, London, UK, September 7-9, 2005. Proceedings 12 . Springer, 303–319. Manuscript submitted to ACM 22 Han Liu, et al

  50. [50]

    Ruthruff, John Penix, J

    Joseph R. Ruthruff, John Penix, J. David Morgenthaler, Sebastian Elbaum, and Gregg Rothermel. 2008. Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach. In Proceedings of the 30th International Conference on Software Engineering (Leipzig, Germany) (ICSE ’08). Association for Computing Machinery, New York, NY, USA, 341–350...

  51. [51]

    Caitlin Sadowski, Jeffrey Van Gogh, Ciera Jaspan, Emma Soderberg, and Collin Winter. 2015. Tricorder: Building a program analysis ecosystem. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , Vol. 1. IEEE, 598–608

  52. [52]

    Schuster and K.K

    M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093

  53. [53]

    SonarSource. 2022. Sonarqube. https://www.sonarqube.org (Accessed on 01/12/2023)

  54. [54]

    Spotbugs. 2022. Spotbugs. https://spotbugs.github.io (Accessed on 01/12/2023)

  55. [55]

    David A. Tomassi. 2018. Bugs in the wild: examining the effectiveness of static analyzers at finding real-world bugs. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering . ACM, Lake Buena Vista FL USA, 980–982. https://doi.org/10.1145/3236024.3275439

  56. [56]

    Huy Tu and Tim Menzies. 2021. FRUGAL: Unlocking Semi-Supervised Learning for Software Analytics. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . 394–406. https://doi.org/10.1109/ASE51524.2021.9678617

  57. [57]

    Kristín Fjóla Tómasdóttir, Mauricio Aniche, and Arie van Deursen. 2017. Why and how JavaScript developers use linters. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) . 578–589. https://doi.org/10.1109/ASE.2017.8115668

  58. [58]

    Carmine Vassallo, Sebastiano Panichella, Fabio Palomba, Sebastian Proksch, Harald C Gall, and Andy Zaidman. 2020. How developers engage with static analysis tools in different contexts. Empirical Software Engineering 25 (2020), 1419–1457

  59. [59]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

  60. [60]

    Chengpeng Wang, Wenyang Wang, Peisen Yao, Qingkai Shi, Jinguo Zhou, Xiao Xiao, and Charles Zhang. 2023. Anchor: Fast and Precise Value-flow Analysis for Containers via Memory Orientation. ACM Trans. Softw. Eng. Methodol. 32, 3, Article 66 (apr 2023), 39 pages. https: //doi.org/10.1145/3565800

  61. [61]

    Junjie Wang, Song Wang, and Qing Wang. 2018. Is There a "Golden" Feature Set for Static Warning Identification? An Experimental Evaluation. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Oulu, Finland) (ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 17, 10 pages. h...

  62. [62]

    Williams and J.K

    C.C. Williams and J.K. Hollingsworth. 2005. Automatic mining of source code repositories to improve bug finding techniques. IEEE Transactions on Software Engineering 31, 6 (2005), 466–480. https://doi.org/10.1109/TSE.2005.63

  63. [63]

    Hongjun Wu, Zhuo Zhang, Shangwen Wang, Yan Lei, Bo Lin, Yihao Qin, Haoyu Zhang, and Xiaoguang Mao. 2021. Peculiar: Smart contract vulnerability detection based on crucial data flow graph and pre-training techniques. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 378–389

  64. [64]

    Wei-Cheng Wu, Bernard Nongpoh, Marwan Nour, Michaël Marcozzi, Sébastien Bardin, and Christophe Hauser. 2023. Fine-Grained Coverage-Based Fuzzing. ACM Trans. Softw. Eng. Methodol. (mar 2023). https://doi.org/10.1145/3587158 Just Accepted

  65. [65]

    Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, and Tim Menzies. 2021. Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy). Empirical Softw. Engg. 26, 3 (may 2021), 24 pages. https://doi.org/10.1007/s10664-021-09948-6

  66. [66]

    Yuzhe Yang and Zhi Xu. 2020. Rethinking the Value of Labels for Improving Class-Imbalanced Learning. In Conference on Neural Information Processing Systems (NeurIPS)

  67. [67]

    Ulas Yüksel and Hasan Sözer. 2013. Automated Classification of Static Code Analysis Alerts: A Case Study. In 2013 IEEE International Conference on Software Maintenance. 532–535. https://doi.org/10.1109/ICSM.2013.89

  68. [68]

    Wojciech Zaremba and Ilya Sutskever. 2015. Learning to Execute. arXiv:1410.4615 [cs.NE]

  69. [69]

    Cen Zhang, Xingwei Lin, Yuekang Li, Yinxing Xue, Jundong Xie, Hongxu Chen, Xinlei Ying, Jiashui Wang, and Yang Liu. 2021. APICraft: Fuzz Driver Generation for Closed-source SDK Libraries. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021 , Michael Bailey and Rachel Greenstadt (Eds.). USENIX Association, 2811–2828. https://www.use...

  70. [70]

    Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 783–794. https://doi.org/10.1109/ICSE.2019.00086

  71. [71]

    Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin. 2021. VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection. IEEE Transactions on Dependable and Secure Computing 18, 5 (2021), 2224–2236. https://doi.org/10.1109/TDSC.2019.2942930 Manuscript submitted to ACM