Recognition: unknown
Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs
Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3
The pith
Modern LLMs detect interprocedural vulnerabilities in C, C++, and Python when given caller and callee code, with strong accuracy at low cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By systematically varying the amount of interprocedural context supplied to the models and measuring performance on the ReposVul dataset, the study finds that Gemini 3 Flash delivers the best cost-effectiveness trade-off for C vulnerabilities, reaching F1 of at least 0.978 at an estimated cost of $0.50 to $0.58 per configuration, while Claude Haiku 4.5 correctly identifies and explains the vulnerability in 93.6 percent of evaluated cases across the three languages.
What carries the argument
Empirical comparison of four LLMs across three graded levels of interprocedural context (target function only, plus callers, plus callees) on 509 labeled vulnerabilities, tracking F1 score, inference cost, and explanation correctness.
If this is right
- Adding callee code to prompts measurably improves detection rates for several models on C vulnerabilities.
- Low-cost models such as Gemini 3 Flash can be used repeatedly inside automated security pipelines without large per-file expenses.
- Explanation quality is model-dependent, with Claude Haiku 4.5 producing the highest share of accurate written justifications.
- The same prompting approach works across C, C++, and Python, supporting multi-language security tools.
- Performance remains high even when only one side of the call relationship is supplied, reducing the need for full program analysis.
Where Pith is reading between the lines
- Security tools could automatically select which context level to include based on a quick first pass, balancing accuracy against token cost.
- The approach might extend to detecting other defect classes such as concurrency bugs or resource leaks that also cross function boundaries.
- Routine whole-repository scans become feasible if per-configuration costs stay under one dollar, potentially catching issues earlier in development.
- Without independent human review of explanations, some high F1 scores may mask cases where models give plausible but incorrect reasoning.
Load-bearing premise
The 509 selected vulnerabilities and the way caller and callee snippets are extracted and shown to the models represent the practical difficulties of interprocedural detection, and the models' explanations do not require separate human checking for correctness.
What would settle it
Re-running the four LLMs on a fresh collection of interprocedural vulnerabilities from additional open-source projects and obtaining F1 scores below 0.90 or explanation accuracy below 80 percent would undermine the reported performance claims.
Figures
read the original abstract
Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically evaluates four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, Gemini 3 Flash) for interprocedural vulnerability detection across C, C++, and Python. It uses 509 vulnerabilities from ReposVul, varies context levels (target function only, plus callers, plus callees), and reports F1 scores, inference costs, and explanation quality, claiming Gemini 3 Flash provides the best cost-effectiveness for C (F1 >= 0.978 at $0.50-$0.58 per configuration) and Claude Haiku 4.5 achieves 93.6% correct identification and explanation.
Significance. If the evaluation methodology is strengthened, the work provides actionable insights into LLM selection for multi-language, context-aware vulnerability detection tools, including cost trade-offs that could inform practical AI-assisted security analysis.
major comments (3)
- The headline result that Claude Haiku 4.5 'correctly identified and explained the vulnerability in 93.6% of the evaluated cases' is load-bearing for the utility claims, yet the manuscript provides no rubric, scoring procedure, inter-rater reliability, blinded review, or independent validation for judging free-text explanations against ReposVul ground truth.
- Reported F1 scores such as >= 0.978 for Gemini 3 Flash include no error bars, confidence intervals, statistical tests, or comparisons against baselines (e.g., single-function prompting, traditional static analyzers, or simpler ML models), making it impossible to assess whether interprocedural context yields statistically meaningful gains.
- The interprocedural context extraction (caller/callee snippets) is presented without validation that the method produces realistic static-analysis-like output or avoids information leakage that would not occur in an actual tool pipeline; this directly affects the generalizability claim.
minor comments (2)
- Abstract and results sections should report the exact distribution of the 509 vulnerabilities across languages and context configurations to allow readers to interpret per-language performance.
- Clarify model naming consistency (e.g., GPT-4.1 Mini, GPT-5 Mini) and provide the precise prompting templates used for each context level.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the evaluation methodology can be strengthened in several respects and will revise the manuscript accordingly. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: The headline result that Claude Haiku 4.5 'correctly identified and explained the vulnerability in 93.6% of the evaluated cases' is load-bearing for the utility claims, yet the manuscript provides no rubric, scoring procedure, inter-rater reliability, blinded review, or independent validation for judging free-text explanations against ReposVul ground truth.
Authors: We acknowledge that the current manuscript does not describe the evaluation procedure for explanation quality in sufficient detail. The 93.6% figure was obtained by manually comparing each LLM-generated explanation against the vulnerability description and location provided in the ReposVul ground truth, marking a case as correct only when the explanation identified the vulnerable code pattern and its root cause. We agree this process should be formalized. In the revision we will add a dedicated subsection that (1) presents the explicit rubric with criteria and examples, (2) reports how many cases were double-annotated and the resulting inter-rater agreement, and (3) discusses the absence of blinding as a limitation. We will also make the annotated subset available as supplementary material. revision: yes
-
Referee: Reported F1 scores such as >= 0.978 for Gemini 3 Flash include no error bars, confidence intervals, statistical tests, or comparisons against baselines (e.g., single-function prompting, traditional static analyzers, or simpler ML models), making it impossible to assess whether interprocedural context yields statistically meaningful gains.
Authors: We agree that the absence of uncertainty estimates and baseline comparisons weakens the interpretability of the F1 results. The reported F1 scores were computed directly on the full set of 509 vulnerabilities without resampling. In the revision we will (1) add bootstrap confidence intervals (or binomial confidence intervals) for all F1 scores, (2) include a direct comparison against a single-function-only prompting baseline using the same LLMs, and (3) add a brief discussion of why a head-to-head comparison with traditional static analyzers was outside the scope of this LLM-centric study while referencing relevant prior benchmarks. These additions will allow readers to evaluate the incremental benefit of interprocedural context. revision: yes
-
Referee: The interprocedural context extraction (caller/callee snippets) is presented without validation that the method produces realistic static-analysis-like output or avoids information leakage that would not occur in an actual tool pipeline; this directly affects the generalizability claim.
Authors: We recognize that the extraction procedure was described at a high level without explicit validation. Callers and callees were obtained by parsing the repository with tree-sitter and selecting the nearest enclosing functions; snippets were truncated to a fixed token budget to simulate realistic context limits. We did not, however, compare the extracted snippets against outputs from production static-analysis tools or test for leakage of future information. In the revision we will add a validation subsection that (1) provides concrete examples of extracted caller/callee snippets, (2) reports a manual audit of 50 random cases confirming absence of post-target code, and (3) discusses remaining threats to generalizability when the extraction is performed by an actual static analyzer rather than our offline script. revision: yes
Circularity Check
No circularity: purely empirical evaluation on external dataset
full rationale
The paper reports an empirical study that selects 509 vulnerabilities from the external ReposVul dataset, extracts caller/callee context snippets, prompts four LLMs under three context conditions, and measures F1 scores plus explanation quality via direct comparison to ground-truth labels. No equations, fitted parameters, derivations, or self-citation chains appear in the reported methodology or results. All performance numbers are computed from fresh LLM outputs on the fixed dataset; none reduce to prior results by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The ReposVul dataset contains vulnerabilities whose detection requires interprocedural context from callers and callees.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Claude Haiku 4.5. https://www.anthropic.com/claude
2025
-
[2]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, et al. 2022. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG] https://arxiv.org/abs/ 2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)
Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates
1988
- [4]
-
[5]
Michael Fu, Chakkrit Tantithamthavorn, Van Nguyen, and Trung Le. 2023. Chat- GPT for Vulnerability Detection, Classification, and Repair: How Far Are We?. In30th Asia-Pacific Software Engineering Conference (APSEC)
2023
-
[6]
Lucas Bastos Germano, Ronaldo Ribeiro Goldschmidt, Ricardo Choren Noya, and Julio Cesar Duarte. 2025. A Systematic Review on Detection, Repair, and Explanation of Vulnerabilities in Source Code Using Large Language Models. IEEE Access13 (2025), 192263–192293. doi:10.1109/ACCESS.2025.3631363
-
[7]
Google DeepMind. 2025. Gemini 3.0 Flash. https://deepmind.google/technologies/ gemini/
2025
-
[8]
Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. 2018. Lemna: Explaining deep learning based security applications. InACM SIGSAC conference on computer and communications security. 364–379
2018
-
[9]
Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2025. Understanding the effectiveness of large language models in detecting security vulnerabilities. InIEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 103–114
2025
-
[10]
Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. In29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC-FSE). 292–303
2021
-
[11]
Zhen Li, Ning Wang, Deqing Zou, Yating Li, Ruqian Zhang, Shouhuai Xu, Chao Zhang, and Hai Jin. 2024. On the Effectiveness of Function-Level Vulnerability Detectors for Inter-Procedural Vulnerabilities. InIEEE/ACM 46th International Conference on Software Engineering. ACM. doi:10.1145/3597503.3639218
-
[12]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing19, 4 (2021), 2244–2258
2021
-
[13]
Kevin Lira, Baldoino Fonseca, Wesley KG Assunccao, Davy Baya, and Marcio Ribeiro. 2025. Beyond Code Explanations: a Ray of Hope for Cross-Language Vul- nerability Repair. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, 01–09
2025
-
[14]
Kevin Lira, Baldoino Fonseca, Davy Baía, Márcio Ribeiro, and Wesley K. G. Assunção. 2026. Code and Dataset: Vulnerability Detection with Interprocedural Context in Multiple Languages. https://github.com/kevinwsbr/sw-vuln
2026
-
[15]
O’Reilly Media, Inc
Mark Lutz. 2001.Programming python. " O’Reilly Media, Inc. "
2001
-
[16]
Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157
1947
-
[17]
National Vulnerability Database. 2017. CVE-2016-10129: libgit2 Git Smart Proto- col Empty Packet Line Denial of Service Vulnerability. https://nvd.nist.gov/vuln/ detail/CVE-2016-10129. Accessed: 2026-03-01
2017
-
[18]
OpenAI. 2025. GPT-4.1 Mini. https://openai.com/
2025
-
[19]
OpenAI. 2025. GPT-5 Mini. https://openai.com/
2025
-
[20]
Dennis M Ritchie, Stephen C Johnson, ME Lesk, BW Kernighan, et al. 1978. The C programming language.Bell Sys. Tech. J57, 6 (1978), 1991–2019
1978
- [21]
-
[22]
2013.The C++ programming language
Bjarne Stroustrup. 2013.The C++ programming language. Pearson Education
2013
-
[23]
Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, and Qing Liao. 2024. ReposVul: A Repository-Level High-Quality Vulnerability Dataset. InIEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. ACM, 472–483. doi:10.1145/3639478.3647634 Vulnerability Detection with Inteprocedural Context in Multiple Languages EAS...
-
[24]
Kluwer Academic Publishers, USA (2000)
Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Bjöorn Regnell, and Anders Wesslén. 2000.Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, USA. doi:10.1007/978-1-4615-4625-2
-
[25]
Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. 2022. Sustainable ai: Environmental implications, challenges and opportunities.Pro- ceedings of machine learning and systems4 (2022), 795–813
2022
-
[26]
Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 145 (May 2025), 31 pages. doi:10.1145/3708522
-
[27]
Xin Zhou, Ting Zhang, and David Lo. 2024. Large Language Model for Vulner- ability Detection: Emerging Results and Future Directions. InIEEE/ACM 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER ’24). ACM, 47–51. doi:10.1145/3639476.3639762
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.