arxiv: 2604.08417 · v1 · submitted 2026-04-09 · 💻 cs.SE · cs.CR

Recognition: unknown

Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs

Kevin Lira , Baldoino Fonseca , Davy Ba\'ia , M\'arcio Ribeiro , Wesley K. G. Assun\c{c}\~ao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords vulnerability detectionlarge language modelsinterprocedural contextsoftware securityempirical evaluationC C++ Pythoncost-effectivenessexplanation quality

0 comments

The pith

Modern LLMs detect interprocedural vulnerabilities in C, C++, and Python when given caller and callee code, with strong accuracy at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether supplying surrounding function context helps large language models find security flaws that cross function boundaries. It runs four current LLMs on 509 real vulnerabilities drawn from C, C++, and Python projects, trying three prompt setups that add no extra code, caller code, or callee code. Detection accuracy, dollar cost of queries, and quality of the models' written explanations are all measured. Gemini 3 Flash reaches F1 scores of 0.978 or higher on C cases for roughly 50 cents per run, while Claude Haiku 4.5 gives correct identifications plus correct explanations in 93.6 percent of the tested cases. These numbers matter because many practical vulnerabilities only appear when data or control flows leave the single function under review.

Core claim

By systematically varying the amount of interprocedural context supplied to the models and measuring performance on the ReposVul dataset, the study finds that Gemini 3 Flash delivers the best cost-effectiveness trade-off for C vulnerabilities, reaching F1 of at least 0.978 at an estimated cost of $0.50 to $0.58 per configuration, while Claude Haiku 4.5 correctly identifies and explains the vulnerability in 93.6 percent of evaluated cases across the three languages.

What carries the argument

Empirical comparison of four LLMs across three graded levels of interprocedural context (target function only, plus callers, plus callees) on 509 labeled vulnerabilities, tracking F1 score, inference cost, and explanation correctness.

If this is right

Adding callee code to prompts measurably improves detection rates for several models on C vulnerabilities.
Low-cost models such as Gemini 3 Flash can be used repeatedly inside automated security pipelines without large per-file expenses.
Explanation quality is model-dependent, with Claude Haiku 4.5 producing the highest share of accurate written justifications.
The same prompting approach works across C, C++, and Python, supporting multi-language security tools.
Performance remains high even when only one side of the call relationship is supplied, reducing the need for full program analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security tools could automatically select which context level to include based on a quick first pass, balancing accuracy against token cost.
The approach might extend to detecting other defect classes such as concurrency bugs or resource leaks that also cross function boundaries.
Routine whole-repository scans become feasible if per-configuration costs stay under one dollar, potentially catching issues earlier in development.
Without independent human review of explanations, some high F1 scores may mask cases where models give plausible but incorrect reasoning.

Load-bearing premise

The 509 selected vulnerabilities and the way caller and callee snippets are extracted and shown to the models represent the practical difficulties of interprocedural detection, and the models' explanations do not require separate human checking for correctness.

What would settle it

Re-running the four LLMs on a fresh collection of interprocedural vulnerabilities from additional open-source projects and obtaining F1 scores below 0.90 or explanation accuracy below 80 percent would undermine the reported performance claims.

Figures

Figures reproduced from arXiv: 2604.08417 by Baldoino Fonseca, Davy Ba\'ia, Kevin Lira, M\'arcio Ribeiro, Wesley K. G. Assun\c{c}\~ao.

**Figure 1.** Figure 1: Vulnerable callee (smart_pkt.c). The empty-packet branch silently sets *head = NULL and returns success. about the functions that invoke it (i.e., callers). This variation in input configurations enables us to evaluate whether providing more contextual information to the LLMs affects their effectiveness. We also assess the economic cost associated with the effectiveness of the LLMs. Finally, we instrument … view at source ↗

**Figure 2.** Figure 2: Exploitable caller (smart_protocol.c). A NULL pointer delivered on a success return is dereferenced without a null guard. determine whether this behavior is unsafe. The callee itself does not dereference a NULL pointer, and the assignment may plausibly serve as an intentional sentinel value designed for safe handling by the caller. The vulnerability arises solely from the interprocedural interaction betwee… view at source ↗

**Figure 3.** Figure 3: Prompt template used [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cost–performance trade-off across programming languages. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs a useful empirical comparison of LLMs on interprocedural vulnerability detection across three languages and context levels, but the headline explanation success rate rests on unvalidated author judgment.

read the letter

The core thing to know is that adding caller or callee snippets improves LLM detection on the ReposVul cases, with Gemini 3 Flash showing strong cost-effectiveness for C at roughly $0.50 per configuration and Claude Haiku 4.5 reaching 93.6% on combined identification and explanation. The work is a straightforward extension of single-function studies to multiple context depths and languages, which prior abstracts had left open. They track both F1 scores and actual inference costs, which gives tool builders something concrete to work with rather than just accuracy claims. The dataset of 509 vulnerabilities is external and fixed, so the comparison itself is reproducible in principle. The systematic sweep over context levels is the clearest new piece. The explanation quality number is the weakest part. It comes from judging the LLMs' free-text outputs against the known ground truth, apparently without a documented rubric, blinded raters, or inter-rater reliability check. That makes the 93.6% figure hard to interpret as a stable result. The abstract also gives no error bars, statistical tests, or direct baseline comparisons to simpler prompting or non-LLM static tools. How the caller and callee snippets were extracted and whether they avoid unrealistic information leakage is not visible here either. This paper is aimed at people building or benchmarking LLM-based security scanners who need cost and context data across languages. A reader already working on automated vulnerability tools would get practical numbers to compare against their own setups. It deserves a serious referee because the experimental design is focused and the questions are relevant to current tool development, even with the gaps in validation. Send it to review and ask for the explanation scoring protocol plus a small human validation sample.

Referee Report

3 major / 2 minor

Summary. The paper empirically evaluates four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, Gemini 3 Flash) for interprocedural vulnerability detection across C, C++, and Python. It uses 509 vulnerabilities from ReposVul, varies context levels (target function only, plus callers, plus callees), and reports F1 scores, inference costs, and explanation quality, claiming Gemini 3 Flash provides the best cost-effectiveness for C (F1 >= 0.978 at $0.50-$0.58 per configuration) and Claude Haiku 4.5 achieves 93.6% correct identification and explanation.

Significance. If the evaluation methodology is strengthened, the work provides actionable insights into LLM selection for multi-language, context-aware vulnerability detection tools, including cost trade-offs that could inform practical AI-assisted security analysis.

major comments (3)

The headline result that Claude Haiku 4.5 'correctly identified and explained the vulnerability in 93.6% of the evaluated cases' is load-bearing for the utility claims, yet the manuscript provides no rubric, scoring procedure, inter-rater reliability, blinded review, or independent validation for judging free-text explanations against ReposVul ground truth.
Reported F1 scores such as >= 0.978 for Gemini 3 Flash include no error bars, confidence intervals, statistical tests, or comparisons against baselines (e.g., single-function prompting, traditional static analyzers, or simpler ML models), making it impossible to assess whether interprocedural context yields statistically meaningful gains.
The interprocedural context extraction (caller/callee snippets) is presented without validation that the method produces realistic static-analysis-like output or avoids information leakage that would not occur in an actual tool pipeline; this directly affects the generalizability claim.

minor comments (2)

Abstract and results sections should report the exact distribution of the 509 vulnerabilities across languages and context configurations to allow readers to interpret per-language performance.
Clarify model naming consistency (e.g., GPT-4.1 Mini, GPT-5 Mini) and provide the precise prompting templates used for each context level.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the evaluation methodology can be strengthened in several respects and will revise the manuscript accordingly. Below we respond point by point to the major comments.

read point-by-point responses

Referee: The headline result that Claude Haiku 4.5 'correctly identified and explained the vulnerability in 93.6% of the evaluated cases' is load-bearing for the utility claims, yet the manuscript provides no rubric, scoring procedure, inter-rater reliability, blinded review, or independent validation for judging free-text explanations against ReposVul ground truth.

Authors: We acknowledge that the current manuscript does not describe the evaluation procedure for explanation quality in sufficient detail. The 93.6% figure was obtained by manually comparing each LLM-generated explanation against the vulnerability description and location provided in the ReposVul ground truth, marking a case as correct only when the explanation identified the vulnerable code pattern and its root cause. We agree this process should be formalized. In the revision we will add a dedicated subsection that (1) presents the explicit rubric with criteria and examples, (2) reports how many cases were double-annotated and the resulting inter-rater agreement, and (3) discusses the absence of blinding as a limitation. We will also make the annotated subset available as supplementary material. revision: yes
Referee: Reported F1 scores such as >= 0.978 for Gemini 3 Flash include no error bars, confidence intervals, statistical tests, or comparisons against baselines (e.g., single-function prompting, traditional static analyzers, or simpler ML models), making it impossible to assess whether interprocedural context yields statistically meaningful gains.

Authors: We agree that the absence of uncertainty estimates and baseline comparisons weakens the interpretability of the F1 results. The reported F1 scores were computed directly on the full set of 509 vulnerabilities without resampling. In the revision we will (1) add bootstrap confidence intervals (or binomial confidence intervals) for all F1 scores, (2) include a direct comparison against a single-function-only prompting baseline using the same LLMs, and (3) add a brief discussion of why a head-to-head comparison with traditional static analyzers was outside the scope of this LLM-centric study while referencing relevant prior benchmarks. These additions will allow readers to evaluate the incremental benefit of interprocedural context. revision: yes
Referee: The interprocedural context extraction (caller/callee snippets) is presented without validation that the method produces realistic static-analysis-like output or avoids information leakage that would not occur in an actual tool pipeline; this directly affects the generalizability claim.

Authors: We recognize that the extraction procedure was described at a high level without explicit validation. Callers and callees were obtained by parsing the repository with tree-sitter and selecting the nearest enclosing functions; snippets were truncated to a fixed token budget to simulate realistic context limits. We did not, however, compare the extracted snippets against outputs from production static-analysis tools or test for leakage of future information. In the revision we will add a validation subsection that (1) provides concrete examples of extracted caller/callee snippets, (2) reports a manual audit of 50 random cases confirming absence of post-target code, and (3) discusses remaining threats to generalizability when the extraction is performed by an actual static analyzer rather than our offline script. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external dataset

full rationale

The paper reports an empirical study that selects 509 vulnerabilities from the external ReposVul dataset, extracts caller/callee context snippets, prompts four LLMs under three context conditions, and measures F1 scores plus explanation quality via direct comparison to ground-truth labels. No equations, fitted parameters, derivations, or self-citation chains appear in the reported methodology or results. All performance numbers are computed from fresh LLM outputs on the fixed dataset; none reduce to prior results by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the ReposVul dataset and the assumption that LLM responses to provided context reflect genuine detection capability; no new mathematical constructs or entities are introduced.

axioms (1)

domain assumption The ReposVul dataset contains vulnerabilities whose detection requires interprocedural context from callers and callees.
This underpins the choice of context variations and the claim that single-function analysis is insufficient.

pith-pipeline@v0.9.0 · 5597 in / 1445 out tokens · 61538 ms · 2026-05-10T17:10:56.429582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Anthropic. 2025. Claude Haiku 4.5. https://www.anthropic.com/claude

2025
[2]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, et al. 2022. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG] https://arxiv.org/abs/ 2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates

1988
[4]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability Detection with Code Language Models: How Far Are We?arXiv preprint arXiv:2403.18624(2024)

work page arXiv 2024
[5]

Michael Fu, Chakkrit Tantithamthavorn, Van Nguyen, and Trung Le. 2023. Chat- GPT for Vulnerability Detection, Classification, and Repair: How Far Are We?. In30th Asia-Pacific Software Engineering Conference (APSEC)

2023
[6]

Lucas Bastos Germano, Ronaldo Ribeiro Goldschmidt, Ricardo Choren Noya, and Julio Cesar Duarte. 2025. A Systematic Review on Detection, Repair, and Explanation of Vulnerabilities in Source Code Using Large Language Models. IEEE Access13 (2025), 192263–192293. doi:10.1109/ACCESS.2025.3631363

work page doi:10.1109/access.2025.3631363 2025
[7]

Google DeepMind. 2025. Gemini 3.0 Flash. https://deepmind.google/technologies/ gemini/

2025
[8]

Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. 2018. Lemna: Explaining deep learning based security applications. InACM SIGSAC conference on computer and communications security. 364–379

2018
[9]

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2025. Understanding the effectiveness of large language models in detecting security vulnerabilities. InIEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 103–114

2025
[10]

Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. In29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC-FSE). 292–303

2021
[11]

Zhen Li, Ning Wang, Deqing Zou, Yating Li, Ruqian Zhang, Shouhuai Xu, Chao Zhang, and Hai Jin. 2024. On the Effectiveness of Function-Level Vulnerability Detectors for Inter-Procedural Vulnerabilities. InIEEE/ACM 46th International Conference on Software Engineering. ACM. doi:10.1145/3597503.3639218

work page doi:10.1145/3597503.3639218 2024
[12]

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing19, 4 (2021), 2244–2258

2021
[13]

Kevin Lira, Baldoino Fonseca, Wesley KG Assunccao, Davy Baya, and Marcio Ribeiro. 2025. Beyond Code Explanations: a Ray of Hope for Cross-Language Vul- nerability Repair. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, 01–09

2025
[14]

Kevin Lira, Baldoino Fonseca, Davy Baía, Márcio Ribeiro, and Wesley K. G. Assunção. 2026. Code and Dataset: Vulnerability Detection with Interprocedural Context in Multiple Languages. https://github.com/kevinwsbr/sw-vuln

2026
[15]

O’Reilly Media, Inc

Mark Lutz. 2001.Programming python. " O’Reilly Media, Inc. "

2001
[16]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157

1947
[17]

National Vulnerability Database. 2017. CVE-2016-10129: libgit2 Git Smart Proto- col Empty Packet Line Denial of Service Vulnerability. https://nvd.nist.gov/vuln/ detail/CVE-2016-10129. Accessed: 2026-03-01

2017
[18]

OpenAI. 2025. GPT-4.1 Mini. https://openai.com/

2025
[19]

OpenAI. 2025. GPT-5 Mini. https://openai.com/

2025
[20]

Dennis M Ritchie, Stephen C Johnson, ME Lesk, BW Kernighan, et al. 1978. The C programming language.Bell Sys. Tech. J57, 6 (1978), 1991–2019

1978
[21]

Sohan Singh Saimbhi and Kevser Ovaz Akpinar. 2024. VulnerAI: GPT Based Web Application Vulnerability Detection. In2024 International Conference on Artificial Intelligence, Metaverse and Cybersecurity (ICAMAC). 1–6. doi:10.1109/ ICAMAC62387.2024.10828788

work page arXiv 2024
[22]

2013.The C++ programming language

Bjarne Stroustrup. 2013.The C++ programming language. Pearson Education

2013
[23]

Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, and Qing Liao. 2024. ReposVul: A Repository-Level High-Quality Vulnerability Dataset. InIEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. ACM, 472–483. doi:10.1145/3639478.3647634 Vulnerability Detection with Inteprocedural Context in Multiple Languages EAS...

work page doi:10.1145/3639478.3647634 2024
[24]

Kluwer Academic Publishers, USA (2000)

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Bjöorn Regnell, and Anders Wesslén. 2000.Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, USA. doi:10.1007/978-1-4615-4625-2

work page doi:10.1007/978-1-4615-4625-2 2000
[25]

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. 2022. Sustainable ai: Environmental implications, challenges and opportunities.Pro- ceedings of machine learning and systems4 (2022), 795–813

2022
[26]

Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 145 (May 2025), 31 pages. doi:10.1145/3708522

work page doi:10.1145/3708522 2025
[27]

Xin Zhou, Ting Zhang, and David Lo. 2024. Large Language Model for Vulner- ability Detection: Emerging Results and Future Directions. InIEEE/ACM 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER ’24). ACM, 47–51. doi:10.1145/3639476.3639762

work page doi:10.1145/3639476.3639762 2024