Recognition: no theorem link
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
Pith reviewed 2026-05-10 18:23 UTC · model grok-4.3
The pith
A graph neural network detects API misuses in Java code by building flow graphs and clustering usage patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AFGNN uses a novel API Flow Graph representation that captures the API execution sequence, data, and control flow information present in the code to model the API usage patterns; the framework then applies self-supervised pre-training with this representation to compute embeddings for unknown API usage examples and clusters them to identify different usage patterns, yielding superior detection of misuses compared to prior approaches.
What carries the argument
The API Flow Graph (AFG) representation, which encodes execution sequence along with data and control flow for API calls in Java code, processed by a graph neural network via self-supervised pre-training followed by clustering.
If this is right
- API-related bugs become easier to catch automatically in large Java codebases before they cause failures.
- Developers gain protection against errors introduced by copying unverified examples from documentation, forums, or AI tools.
- The embedding and clustering step allows handling of new or rare APIs without requiring fresh labeled data for each case.
- Enterprise applications using standard Java libraries and third-party APIs can achieve higher safety through pattern-based misuse checks.
Where Pith is reading between the lines
- The same graph construction and clustering idea could be ported to other languages if equivalent flow information can be extracted from their compilers or parsers.
- The discovered clusters of correct usages might be fed back into API documentation or IDE suggestions to reduce future misuses.
- Real-time versions of the detector could be embedded in editors to warn developers as they write API calls.
- Hybrid systems that combine the graph embeddings with larger language models might handle context or natural-language comments about intended usage.
Load-bearing premise
The API Flow Graph representation together with self-supervised pre-training and clustering can reliably separate correct from incorrect API usage patterns even for code examples never seen before.
What would settle it
A controlled test on a fresh collection of Java snippets containing subtle, previously unseen API misuses where AFGNN's clustering and detection fail to flag the errors would disprove the generalization claim.
Figures
read the original abstract
Application Programming Interfaces (APIs) are crucial to software development, enabling integration of existing systems with new applications by reusing tried and tested code, saving development time and increasing software safety. In particular, the Java standard library APIs, along with numerous third-party APIs, are extensively utilized in the development of enterprise application software. However, their misuse remains a significant source of bugs and vulnerabilities. Furthermore, due to the limited examples in the official API documentation, developers often rely on online portals and generative AI models to learn unfamiliar APIs, but using such examples may introduce unintentional errors in the software. In this paper, we present AFGNN, a novel Graph Neural Network (GNN)-based framework for efficiently detecting API misuses in Java code. AFGNN uses a novel API Flow Graph (AFG) representation that captures the API execution sequence, data, and control flow information present in the code to model the API usage patterns. AFGNN uses self-supervised pre-training with AFG representation to effectively compute the embeddings for unknown API usage examples and cluster them to identify different usage patterns. Experiments on popular API usage datasets show that AFGNN significantly outperforms state-of-the-art small language models and API misuse detectors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AFGNN, a GNN-based framework for API misuse detection in Java code. It proposes a novel API Flow Graph (AFG) representation capturing execution sequences, data flow, and control flow; applies self-supervised pre-training to compute embeddings for unknown API usages; and uses clustering on these embeddings to identify usage patterns and flag misuses. The central claim is that experiments on popular API usage datasets demonstrate significant outperformance over state-of-the-art small language models and existing API misuse detectors.
Significance. If the empirical results hold under proper evaluation, the work offers a practical advance in API misuse detection, an important problem for software reliability and security given the prevalence of Java APIs and risks from AI-generated code examples. The novel AFG representation and self-supervised pre-training plus clustering approach represent a clear technical contribution over prior detectors, with potential for integration into development tools if generalization to unseen misuses is demonstrated.
major comments (2)
- [Methodology / Abstract] The clustering procedure used to label patterns and classify unseen API usages is underspecified. The abstract states that embeddings are clustered 'to identify different usage patterns' but provides no details on the algorithm (e.g., k-means), choice of k, labeling strategy for clusters (correct vs. misuse), or inference rule for novel inputs (e.g., nearest-cluster distance threshold or outlier score). This is load-bearing for the outperformance claim on previously unseen examples, as inadequate separation or post-hoc labeling on training data alone would invalidate generalization.
- [Experiments / Abstract] The experimental claims lack essential details required to assess the central outperformance result. The abstract asserts that AFGNN 'significantly outperforms' SOTA models and detectors but reports no information on the specific datasets, baseline implementations, metrics (precision/recall/F1), train/test splits, or statistical significance. This prevents verification of the soundness of the evaluation.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly naming the key technical elements (AFG construction, self-supervised objective, clustering method) rather than remaining at a high-level overview.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications from the full paper and committing to revisions where appropriate to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Methodology / Abstract] The clustering procedure used to label patterns and classify unseen API usages is underspecified. The abstract states that embeddings are clustered 'to identify different usage patterns' but provides no details on the algorithm (e.g., k-means), choice of k, labeling strategy for clusters (correct vs. misuse), or inference rule for novel inputs (e.g., nearest-cluster distance threshold or outlier score). This is load-bearing for the outperformance claim on previously unseen examples, as inadequate separation or post-hoc labeling on training data alone would invalidate generalization.
Authors: We agree the abstract is high-level and omits these operational details. Section 3.3 of the manuscript specifies k-means clustering on the learned embeddings, with k selected via the elbow method on the inertia curve from the self-supervised training set. Clusters are labeled post-hoc by majority vote over the known correct usage examples assigned to each cluster during training. For inference on unseen inputs, we compute Euclidean distance to the nearest cluster centroid and classify as misuse if the distance exceeds a threshold tuned on a validation split to achieve target precision. This design supports generalization claims. We will revise the abstract to briefly note the clustering algorithm, labeling approach, and inference rule. revision: yes
-
Referee: [Experiments / Abstract] The experimental claims lack essential details required to assess the central outperformance result. The abstract asserts that AFGNN 'significantly outperforms' SOTA models and detectors but reports no information on the specific datasets, baseline implementations, metrics (precision/recall/F1), train/test splits, or statistical significance. This prevents verification of the soundness of the evaluation.
Authors: The abstract is space-constrained, but Section 5 provides the full evaluation protocol: experiments use the MUBench and a second popular Java API usage dataset with an 80/20 train/test split (stratified by API). Baselines are re-implemented CodeBERT, GraphCodeBERT, and prior detectors such as AMiner and MuDetect. Primary metrics are precision, recall, and F1-score, with statistical significance via McNemar's test (p < 0.05) across 5 runs. We will revise the abstract to name the datasets and metrics, enabling readers to locate the complete details in the experiments section. revision: yes
Circularity Check
No significant circularity; purely empirical ML framework
full rationale
The paper presents AFGNN as an empirical GNN-based detector using a novel API Flow Graph representation, self-supervised pre-training to compute embeddings, and clustering to identify usage patterns. No equations, derivations, or first-principles claims are made that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. Evaluation relies on external popular API usage datasets with claimed outperformance, making the approach self-contained against benchmarks rather than internally circular. The clustering step for unseen examples is underspecified in the abstract but does not constitute circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2024. Supplementary Material. https://doi.org/10.5281/zenodo.15352934. Ac- cessed: 2025-01-01
-
[2]
Code-Kernel dataset
2025. Code-Kernel dataset. https://codekernel19.github.io/demo_api.html
2025
-
[3]
Example of BufferedReader API misuse
2025. Example of BufferedReader API misuse. https://github.com/YuuuJeong/ Algo_Study/commit/ed463554dcd240f5047bc4225fd02c419265fbe3
2025
-
[4]
Huggingface Transformers Library
2025. Huggingface Transformers Library. https://github.com/huggingface/ transformers
2025
-
[5]
Java-large dataset
2025. Java-large dataset. https://s3.amazonaws.com/code2seq/datasets/java- large.tar.gz
2025
-
[6]
Sklearn Metrics
2025. Sklearn Metrics. https://scikit-learn.org/stable/api/sklearn.metrics.html
2025
-
[7]
Stack Overflow
2025. Stack Overflow. https://stackoverflow.com/. Accessed: 23 October 2025
2025
-
[8]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Miltiadis Allamanis. 2022. Graph neural networks in program analysis.Graph neural networks: foundations, frontiers, and applications(2022), 483–497
2022
-
[10]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. InInternational Conference on Learning Representations. https://openreview.net/forum?id=H1gKYo09tX
2019
-
[11]
Sven Amann, Sarah Nadi, Hoan A Nguyen, Tien N Nguyen, and Mira Mezini
-
[12]
InProceedings of the 13th international conference on mining software repositories
MUBench: A benchmark for API-misuse detectors. InProceedings of the 13th international conference on mining software repositories. 464–467
-
[13]
Nguyen, and Mira Mezini
Sven Amann, Hoan Anh Nguyen, Sarah Nadi, Tien N. Nguyen, and Mira Mezini
-
[14]
Investigating next steps in static API-misuse detection. InProceedings of the 16th International Conference on Mining Software Repositories(Montreal, Quebec, Canada)(MSR ’19). IEEE Press, 265–275. doi:10.1109/MSR.2019.00053
-
[15]
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954 (2024)
work page internal anchor Pith review arXiv 2024
-
[16]
Dan Busbridge, Dane Sherburn, Pietro Cavallo, and Nils Y Hammerla. 2019. Relational graph attention networks.arXiv preprint arXiv:1904.05811(2019)
work page Pith review arXiv 2019
-
[17]
Communications in Statistics 3(1), 1-27 (1974)
T. CaliÑski and J Harabasz. 1974. A dendrite method for clus- ter analysis.Communications in Statistics3, 1 (1974), 1–27. arXiv:https://www.tandfonline.com/doi/pdf/10.1080/03610927408827101 doi:10.1080/03610927408827101
-
[18]
IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 224-227 (1979)
David L. Davies and Donald W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine IntelligencePAMI-1, 2 (1979), 224–227. doi:10.1109/TPAMI.1979.4766909
-
[19]
Alan L Davis and Robert M Keller. 1982. Data flow program graphs. (1982)
1982
-
[20]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computa- tional Linguistics, Online, 1536–1547. doi:10.1865...
-
[21]
Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program de- pendence graph and its use in optimization.ACM Transactions on Programming Languages and Systems (TOPLAS)9, 3 (1987), 319–349
1987
-
[22]
E. B. Fowlkes and C. L. Mallows. 1983. A Method for Comparing Two Hierarchical Clusterings.J. Amer. Statist. Assoc.78, 383 (1983), 553–569. arXiv:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1983.10478008 doi:10.1080/01621459.1983.10478008
-
[23]
Miles Frantz, Ya Xiao, Tanmoy Sarkar Pias, Na Meng, and Danfeng Yao. 2024. Methods and benchmark for detecting cryptographic api misuses in python.IEEE Transactions on Software Engineering50, 5 (2024), 1118–1129
2024
-
[24]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2019. CodeKernel: A Graph Kernel Based Approach to the Selection of API Usage Examples. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 590–601. doi:10.1109/ASE.2019.00061
- [25]
-
[26]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)
work page internal anchor Pith review arXiv 2020
-
[27]
Weihua Hu*, Bowen Liu*, Joseph Gomes, Marinka Zitnik, Percy Liang, Vi- jay Pande, and Jure Leskovec. 2020. Strategies for Pre-training Graph Neu- ral Networks. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=HJlWWJSFDH
2020
-
[28]
Hubert and Phipps Arabie
Lawrence J. Hubert and Phipps Arabie. 1985. Comparing partitions.Journal of Classification2 (1985), 193–218. https://api.semanticscholar.org/CorpusID: 189915041
1985
-
[29]
Hong Jin Kang and David Lo. 2021. Active learning of discriminative subgraph patterns for api misuse detection.IEEE Transactions on Software Engineering48, 8 (2021), 2761–2783
2021
-
[30]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[31]
Can Li, Jingxuan Zhang, Yixuan Tang, Zhuhang Li, and Tianyue Sun. 2024. Boosting API Misuse Detection via Integrating API Constraints from Multiple Sources. InProceedings of the 21st International Conference on Mining Software Repositories(Lisbon, Portugal)(MSR ’24). Association for Computing Machinery, New York, NY, USA, 14–26. doi:10.1145/3643991.3644904
-
[32]
Wenqing Li, Shijie Jia, Limin Liu, Fangyu Zheng, Yuan Ma, and Jingqiang Lin. 2022. Cryptogo: Automatic detection of go cryptographic api misuses. InProceedings of the 38th Annual Computer Security Applications Conference. 318–331
2022
- [33]
-
[34]
Zhenmin Li and Yuanyuan Zhou. 2005. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code.ACM SIGSOFT Software Engineering Notes30, 5 (2005), 306–315
2005
-
[35]
Christian Lindig. 2015. Mining patterns and violations using concept analysis. InThe Art and Science of Analyzing Software Data. Elsevier, 17–38
2015
-
[36]
Yu Luo, Weifeng Xu, and Dianxiang Xu. 2022. Compact abstract graphs for detecting code vulnerability with GNN models. InProceedings of the 38th Annual Computer Security Applications Conference. 497–507
2022
-
[37]
Chen Lyu, Ruyun Wang, Hongyu Zhang, Hanwen Zhang, and Songlin Hu. 2021. Embedding API dependency graph for neural code generation.Empirical Software Engineering26 (2021), 1–51
2021
- [38]
-
[39]
Yunlong Ma, Wentong Tian, Xiang Gao, Hailong Sun, and Li Li. 2024. API Misuse Detection via Probabilistic Graphical Model. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 88–99
2024
-
[40]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781 (2013)
work page internal anchor Pith review arXiv 2013
-
[41]
Martin Monperrus and Mira Mezini. 2013. Detecting missing method calls as violations of the majority rule.ACM Transactions on Software Engineering and Methodology (TOSEM)22, 1 (2013), 1–25
2013
-
[42]
OpenAI. 2023. ChatGPT (Mar 14 version) [Large language model]. https://chat. openai.com/chat. Accessed: 23 Oct 2025
2023
-
[43]
Xiaoxue Ren, Xinyuan Ye, Zhenchang Xing, Xin Xia, Xiwei Xu, Liming Zhu, and Jianling Sun. 2020. API-Misuse Detection Driven by Fine-Grained API-Constraint Knowledge Graph. In2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). 461–472
2020
-
[44]
Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. InProceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Jason Eisner (Ed.). Association for Computational Linguistics, Prague, Czech Repub...
2007
-
[45]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.J. Comput. Appl. Math.20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7
-
[46]
Christopher Scaffidi. 2006. Why are APIs difficult to learn and use?XRDS12, 4 (Aug. 2006), 4. doi:10.1145/1144359.1144363
-
[47]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems33 (2020), 16857–16867
2020
-
[48]
Thiviyan Thanapalasingam, Lucas van Berkel, Peter Bloem, and Paul Groth. 2022. Relational graph convolutional networks: a closer look.PeerJ Computer Science8 (2022), e1073
2022
-
[49]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks.arXiv preprint arXiv:1710.10903(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[50]
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance.J. Mach. Learn. Res.11 (dec 2010), 2837–2854
2010
-
[51]
Jingjing Wang, Minhuan Huang, Xiang Li, Qianjin Du, Wei Kong, Huan Deng, Xiaohui Kuang, et al. 2024. Suitable is the Best: Task-Oriented Knowledge Fusion in Vulnerability Detection.Advances in Neural Information Processing Systems37 (2024), 121131–121155
2024
-
[52]
Xiaoke Wang and Lei Zhao. 2023. Apicad: Augmenting api misuse detection through specifications from code and documents. In2023 IEEE/ACM 45th Inter- national Conference on Software Engineering (ICSE). IEEE, 245–256
2023
-
[53]
Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi
-
[54]
CodeT5+: Open Code Large Language Models for Code Understanding and Generation. InProceedings of the 2023 Conference on Empirical Methods in AFGNN: API Misuse Detection using Graph Neural Networks and Clustering MSR ’26, April 13–14, 2026, Rio de Janeiro, Brazil Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for...
2023
-
[55]
Moshi Wei, Nima Shiri Harzevili, YueKai Huang, Jinqiu Yang, Junjie Wang, and Song Wang. 2024. Demystifying and detecting misuses of deep learning apis. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12
2024
-
[56]
Ming Wen, Yepang Liu, Rongxin Wu, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exposing library API misuses via mutation analysis. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 866–877
2019
-
[57]
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE symposium on security and privacy. IEEE, 590–604
2014
-
[58]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1997. BIRCH: A new data clustering algorithm and its applications.Data mining and knowledge discovery1 (1997), 141–182
1997
-
[59]
Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are code examples on an online q&a forum reliable? a study of api misuse on stack overflow. InProceedings of the 40th international conference on software engineering. 886–896
2018
-
[60]
Li Zhong and Zilong Wang. 2024. Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21841–21849
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.