pith. sign in

arxiv: 2605.18153 · v1 · pith:VFDHV4XGnew · submitted 2026-05-18 · 💻 cs.SE

Three Heads Are Better Than One: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection

Pith reviewed 2026-05-20 09:13 UTC · model grok-4.3

classification 💻 cs.SE
keywords vulnerability detectionlarge language modelsmulti-agent systemssoftware securitycode analysisreasoning frameworksconflict resolutiondebate mechanisms
0
0 comments X

The pith

Three LLM agents using distinct reasoning modes detect code vulnerabilities more accurately by debating disagreements until resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework in which three large language model agents, each following a separate reasoning style, first review source code independently for potential security issues. These agents then engage in a structured process of challenging one another's conclusions and revising their views through multiple rounds of rebuttal. The goal is to produce a final judgment that accounts for perspectives a single agent would overlook. A sympathetic reader would care because many real-world vulnerabilities arise from complex interactions that isolated analyses tend to miss, and successful collaboration could reduce undetected flaws in deployed software.

Core claim

The central claim is that a multi-perspective reasoning framework built around three specialized LLM agents, combined with an iterative rebuttal-revision debate mechanism, produces superior vulnerability detection by surfacing and correcting individual errors that arise when agents disagree, leading to more reliable identification of security flaws in source code.

What carries the argument

Three LLM agents each embodying a distinct reasoning mode together with the iterative rebuttal-revision debate mechanism that drives conflict resolution and collaborative judgment.

If this is right

  • Vulnerability detection becomes more robust when multiple complementary reasoning styles are forced to reconcile differences rather than relying on any one style alone.
  • The debate mechanism provides a built-in error-correction step that improves results on cases where individual agents reach conflicting conclusions.
  • The same structure can be applied to other code-analysis tasks that benefit from reconciling divergent initial judgments.
  • Automated security tools gain reliability without requiring changes to the underlying language models themselves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Development teams could embed similar multi-agent debate steps inside continuous integration pipelines to flag security risks before code is merged.
  • The approach might generalize to related problems such as detecting logical bugs or performance issues where single analyses often diverge.
  • Further gains could come from testing whether adding a fourth reasoning mode or occasional human oversight strengthens the error-correction effect.

Load-bearing premise

The three reasoning modes are sufficiently different that their debate process corrects errors instead of simply reinforcing shared mistakes or model biases.

What would settle it

Run the framework on a collection of code snippets containing documented vulnerabilities that single-reasoning methods fail to flag; if the multi-agent system also misses them or resolves its internal debates in favor of incorrect outcomes, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.18153 by Bo Lin, Jie Yu, Jing Wang, Jun Ma, Shangwen Wang, Xiaoguang Mao, Xiaoling Li, Xin Peng.

Figure 1
Figure 1. Figure 1: The overlap of vulnerabilities cor￾rectly identified by SAVul (Deductive), Vul￾RAG (Inductive), and VulnSage (Abductive) on the PairVul dataset. While recent LLM-based methods have shown consider￾able promise in vulnerability detection, their effectiveness is often constrained by an implicit reliance on a single reasoning paradigm. To investigate this limitation, we conducted an empirical study on the Pair… view at source ↗
Figure 2
Figure 2. Figure 2: Three motivating cases from the Linux kernel, selected from the PairVul dataset, demonstrating the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The workflow of ReasonVul. This two-stage architecture is a direct response to our findings in Section 3. Relying on a single reasoning paradigm results in critical blind spots, while simplistic aggregation methods like majority voting are insufficient for resolving nuanced disagreements. By contrast, ReasonVul’s debate mechanism facilitates a deeper integration of perspectives, leading to more accurate an… view at source ↗
Figure 4
Figure 4. Figure 4: The impact of debate rounds Introducing debate substantially improves performance: with just 1 round, PairAcc in￾creases to 26.46%, Accuracy to 59.38%, and F1- score to 61.54%. The best results are achieved at 2 rounds, where PairAcc reaches 30.42%, Accu￾racy 62.50%, and F1-score 63.41%, demonstrat￾ing that structured debate effectively promotes cognitive synergy through iterative refinement. Beyond 2 roun… view at source ↗
Figure 5
Figure 5. Figure 5: Case study of the ReasonVul for CVE-2021-41864. structured debate, enables ReasonVul to detect subtle vulnerabilities that might be overlooked by any single reasoning method or by simple aggregation approaches such as majority voting. The condensed transcript of the agent interactions is depicted in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The overlaps of the vulnerabilities detected by [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Automated vulnerability detection is crucial for enhancing software security by identifying potential flaws that attackers could exploit, thereby reducing the reliance on labor-intensive manual code audits. Recent advancements have shifted towards leveraging large language models (LLMs) for vulnerability detection, with techniques like Vul-RAG and VulnSage demonstrating progress through structured prompting and external knowledge integration. However, these approaches typically rely on a single reasoning paradigm, limiting their ability to address the complex and diverse nature of real-world vulnerabilities. To overcome these limitations, we propose ReasonVul, a novel multi-perspective reasoning framework that harnesses cognitive synergy among three specialized LLM agents, each embodying a distinct reasoning mode. The framework begins with independent analyses of the source code, followed by a structured debate mechanism to resolve conflicts through iterative rebuttal and revision, ultimately converging on a collaborative judgment. Evaluated on the PrimeVul dataset, ReasonVul achieves a PairAcc of 40.00% and an F1-score of 72.52%, surpassing the best baseline by 81.24% in PairAcc. Further tests on the JITVUL dataset confirm its generalizability, with a PairAcc of 28.67%. Additionally, we analyzed 542 conflict cases and found that 389 were correctly resolved, highlighting the framework's ability to uncover hidden vulnerabilities through the error-correction mechanism driven by the debate. This work emphasizes the importance of multi-perspective reasoning and collaborative validation in achieving robust and comprehensive vulnerability detection in real-world software systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ReasonVul, a multi-perspective reasoning framework for enhanced vulnerability detection using three specialized LLM agents that perform independent analyses and engage in an iterative debate with rebuttal and revision to resolve conflicts. It reports achieving a PairAcc of 40.00% and F1-score of 72.52% on the PrimeVul dataset, surpassing the best baseline by 81.24%, with generalizability shown on JITVUL at 28.67% PairAcc and correct resolution of 389 out of 542 conflict cases.

Significance. If the experimental claims hold after verification, this work would demonstrate the value of structured multi-agent debate for improving LLM reliability on vulnerability detection tasks, potentially advancing automated security analysis beyond single-paradigm prompting approaches.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (40.00% PairAcc and 81.24% relative gain on PrimeVul) are load-bearing, yet the manuscript provides no description of baseline implementations, exact data splits, or statistical significance tests, preventing assessment of whether the reported lift is robust.
  2. [Evaluation] Evaluation section: the claim that the debate mechanism correctly resolves 389/542 conflicts and drives error correction rests on the assumption of complementary reasoning modes, but no ablation (e.g., replacing iterative rebuttal-revision with majority vote or single extended chain) or control for total LLM calls/tokens is reported, leaving open alternative explanations for the observed gains.
minor comments (1)
  1. [Abstract] Abstract: define PairAcc explicitly and clarify how 'correct resolution' of conflicts is determined against ground truth.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the experimental reporting and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (40.00% PairAcc and 81.24% relative gain on PrimeVul) are load-bearing, yet the manuscript provides no description of baseline implementations, exact data splits, or statistical significance tests, preventing assessment of whether the reported lift is robust.

    Authors: We agree that the abstract and main text would benefit from greater transparency on these points. In the revised manuscript we will add concise descriptions of the baseline implementations, the precise train/test splits on PrimeVul and JITVUL, and the results of statistical significance tests (e.g., McNemar’s test) for the reported PairAcc and F1 improvements. revision: yes

  2. Referee: [Evaluation] Evaluation section: the claim that the debate mechanism correctly resolves 389/542 conflicts and drives error correction rests on the assumption of complementary reasoning modes, but no ablation (e.g., replacing iterative rebuttal-revision with majority vote or single extended chain) or control for total LLM calls/tokens is reported, leaving open alternative explanations for the observed gains.

    Authors: This concern is well-founded. We will include new ablation studies that replace the iterative rebuttal-revision process with (i) a majority-vote aggregator and (ii) a single extended chain-of-thought prompt, while explicitly controlling for the total number of LLM calls and tokens. These experiments will be reported alongside the existing 389/542 conflict-resolution analysis to isolate the contribution of the structured debate. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework evaluation on held-out data

full rationale

The paper introduces ReasonVul as a multi-agent LLM framework with independent analyses followed by rebuttal-revision debate, then reports measured performance (PairAcc 40.00%, F1 72.52% on PrimeVul; 28.67% on JITVUL) and counts of resolved conflicts (389/542) directly from experiments on external datasets. No equations, fitted parameters, or first-principles derivations exist that could reduce to self-defined inputs. Claims rest on observed outcomes rather than quantities constructed by definition or self-citation chains. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical performance of LLM agents whose internal reasoning capabilities are treated as black-box oracles; no new mathematical axioms or invented physical entities are introduced.

axioms (2)
  • domain assumption LLM agents can embody distinct and complementary reasoning modes when given specialized prompts
    Invoked in the description of the three specialized agents and the expectation that their independent analyses will produce useful diversity.
  • domain assumption Structured debate via iterative rebuttal and revision will converge on more accurate judgments than independent analysis
    Central to the framework's error-correction mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5819 in / 1428 out tokens · 36403 ms · 2026-05-20T09:13:12.302000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 2 internal anchors

  1. [1]

    Andrei Arusoaie, Stefan Ciobâca, Vlad Craciun, Dragos Gavrilut, and Dorel Lucanu. 2017. A Comparison of Open- Source Static Analysis Tools for Vulnerability Detection in C/C++ Code. In2017 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). 161–168. doi:10.1109/SYNASC.2017.00035

  2. [2]

    Canan Batur Şahin and Laith Abualigah. 2021. A novel deep learning-based feature selection model for improving the static analysis of vulnerability detection.Neural Computing and Applications33, 20 (2021), 14049–14067

  3. [3]

    Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin Li. 2021. Bgnn4vd: Constructing bidirectional graph neural- network for vulnerability detection.Information and Software Technology136 (2021), 106576. , Vol. 1, No. 1, Article . Publication date: May 2026. ReasonVul: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection 21

  4. [4]

    Sicong Cao, Xiaobing Sun, Lili Bo, Rongxin Wu, Bin Li, and Chuanqi Tao. 2022. MVD: memory-related vulnerability detection based on flow-sensitive graph neural networks. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1456–1468. doi:10.11...

  5. [5]

    Sicong Cao, Xiaobing Sun, Xiaoxue Wu, David Lo, Lili Bo, Bin Li, and Wei Liu. 2024. Coca: Improving and explaining graph neural network-based vulnerability detection systems. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  6. [6]

    Ira Ceka, Feitong Qiao, Anik Dey, Aastha Valecha, Gail Kaiser, and Baishakhi Ray. 2024. Can llm prompting serve as a proxy for static analysis in vulnerability detection.arXiv preprint arXiv:2412.12039(2024)

  7. [7]

    Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and Yulei Sui. 2021. DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network.ACM Trans. Softw. Eng. Methodol.30, 3, Article 38 (April 2021), 33 pages. doi:10.1145/3436877

  8. [8]

    Seyed Shayan Daneshvar, Yu Nong, Xu Yang, Shaowei Wang, and Haipeng Cai. 2025. VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs.ACM Transactions on Software Engineering and Methodology(2025)

  9. [9]

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability Detection with Code Language Models: How Far Are We?. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 469–481

  10. [10]

    Xueying Du, Geng Zheng, Kaixin Wang, Yi Zou, Yujia Wang, Wentai Deng, Jiayi Feng, Mingwei Liu, Bihuan Chen, Xin Peng, et al. 2024. Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag.arXiv preprint arXiv:2406.11147(2024)

  11. [11]

    Andrew Estornell and Yang Liu. 2024. Multi-LLM Debate: Framework, Principals, and Interventions. 37 (2024), 28938–28964. doi:10.52202/079017-0911

  12. [12]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155(2020)

  13. [13]

    Michael Fu and Chakkrit Tantithamthavorn. 2022. Linevul: A transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories. 608–620

  14. [14]

    Michael Fu, Chakkrit Kla Tantithamthavorn, Van Nguyen, and Trung Le. 2023. Chatgpt for vulnerability detection, classification, and repair: How far are we?. In2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 632–636

  15. [15]

    Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, and Chao Zhang. 2023. How far have we gone in vulnerability detection using large language models.arXiv preprint arXiv:2311.12420(2023)

  16. [16]

    GNU. 2025. Gnu cflow: analyzing a collection of c source files, charting control flow within the program. (2025)

  17. [17]

    Nima Shiri Harzevili, Alvine Boaye Belle, Junjie Wang, Song Wang, Zhen Ming, Nachiappan Nagappan, et al. 2023. A survey on automated software vulnerability detection using machine learning and deep learning.arXiv preprint arXiv:2306.11673(2023)

  18. [18]

    Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003

  19. [19]

    Kaiyu He, Mian Zhang, Shuo Yan, Peilin Wu, and Zhiyu Zoey Chen. 2024. Idea: Enhancing the rule learning ability of large language model agent through induction, deduction, and abduction.arXiv preprint arXiv:2408.10455(2024)

  20. [20]

    Sihao Hu, Tiansheng Huang, Fatih İlhan, Selim Furkan Tekin, and Ling Liu. 2023. Large language model-powered smart contract vulnerability detection: New perspectives. In2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). IEEE, 297–306

  21. [21]

    Larry Huynh, Yinghao Zhang, Djimon Jayasundera, Woojin Jeon, Hyoungshick Kim, Tingting Bi, and Jin B. Hong

  22. [22]

    Faster Hash-based Multi-valued Validated Asynchronous Byzantine Agreement

    Detecting Code Vulnerabilities using LLMs. In2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 401–414. doi:10.1109/DSN64029.2025.00047

  23. [23]

    Ridhi Jain, Nicole Gervasoni, Mthandazo Ndhlovu, and Sanjay Rawat. 2023. A code centric evaluation of c/c++ vulnerability datasets for deep learning based vulnerability detection techniques. InProceedings of the 16th Innovations in Software Engineering Conference. 1–10

  24. [24]

    Yuan Jiang, Yujian Zhang, Xiaohong Su, Christoph Treude, and Tiantian Wang. 2024. Stagedvulbert: Multi-granular vulnerability detection with a novel pre-trained code model.IEEE Transactions on Software Engineering(2024)

  25. [25]

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2023. Loogle: Can long-context language models understand long contexts?arXiv preprint arXiv:2311.04939(2023)

  26. [26]

    Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. 2025. Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask.arXiv preprint arXiv:2504.13474 (2025). , Vol. 1, No. 1, Article . Publication date: May 2026. 22 X. Peng, B. Lin, j. Wang, X. Li, J. Ma, J. Yu, X. Mao, S. Wang

  27. [27]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238(2024)

  28. [28]

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection.arXiv preprint arXiv:1801.01681(2018)

  29. [29]

    RongTao Liao, XueHu Yan, and KaiLong Zhu. 2025. KLRAG: Deep Learning Library Vulnerability Detection via Knowledge-Level RAG. InInternational Conference on Intelligent Computing. Springer, 443–456

  30. [30]

    Bo Lin, Shangwen Wang, Yihao Qin, Liqian Chen, and Xiaoguang Mao. 2025. Give LLMs a Security Course: Securing Retrieval-Augmented Code Generation via Knowledge Injection. Association for Computing Machinery, New York, NY, USA. doi:10.1145/3719027.3765049

  31. [31]

    Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang. 2020. Software Vulnerability Detection Using Deep Neural Networks: A Survey.Proc. IEEE108, 10 (2020), 1825–1848. doi:10.1109/JPROC.2020.2993293

  32. [32]

    Stephan Lipp, Sebastian Banescu, and Alexander Pretschner. 2022. An empirical study on the effectiveness of static C code analyzers for vulnerability detection. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis(Virtual, South Korea)(ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 544–555. d...

  33. [33]

    Yuhan Liu, Yuxuan Liu, Xiaoqing Zhang, Xiuying Chen, and Rui Yan. 2025. The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, N...

  34. [34]

    Yisroel Mirsky, George Macon, Michael Brown, Carter Yagemann, Matthew Pruett, Evan Downing, Sukarno Mertoguno, and Wenke Lee. 2023. VulChecker: Graph-based Vulnerability Localization in Source Code. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, 6557–6574

  35. [35]

    Van Nguyen, Trung Le, Chakkrit Tantithamthavorn, John Grundy, and Dinh Phung. 2024. Deep domain adaptation with max-margin principle for cross-project imbalanced software vulnerability detection.ACM Transactions on Software Engineering and Methodology33, 6 (2024), 1–34

  36. [36]

    Yu Nong, Mohammed Aldeen, Long Cheng, Hongxin Hu, Feng Chen, and Haipeng Cai. 2024. Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities.arXiv preprint arXiv:2402.17230 (2024)

  37. [37]

    Chitu Okoli. 2023. Inductive, abductive and deductive theorising.International Journal of Management Concepts and Philosophy16, 3 (2023), 302–316. doi:10.1504/IJMCP.2023.131769

  38. [38]

    Xin Peng, Jieren Cheng, Xiangyan Tang, Jingxin Liu, and Jiahua Wu. 2023. Dual contrastive learning network for graph clustering.IEEE Transactions on Neural Networks and Learning Systems35, 8 (2023), 10846–10856

  39. [39]

    Xin Peng, Jieren Cheng, Xiangyan Tang, Bin Zhang, and Wenxuan Tu. 2024. Multi-view graph imputation network. Information Fusion102 (2024), 102024

  40. [40]

    Xin Peng, Shangwen Wang, Yihao Qin, Bo Lin, Liqian Chen, Jieren Cheng, and Xiaoguang Mao. 2025. Keep It Simple: Self-Adaptive Code Graph Simplification for Accurate Vulnerability Detection.IEEE Transactions on Software Engineering51, 10 (2025), 2744–2763. doi:10.1109/TSE.2025.3593515

  41. [41]

    Md Mahbubur Rahman, Ira Ceka, Chengzhi Mao, Saikat Chakraborty, Baishakhi Ray, and Wei Le. 2024. Towards Causal Deep Learning for Vulnerability Detection. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. Association for Computing Machinery. doi:10.1145/3597503.3639170

  42. [42]

    Shahriyar Zaman Ridoy, Md Shazzad Hossain Shaon, Alfredo Cuzzocrea, and Mst Shapna Akter. 2024. EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code. In 2024 IEEE International Conference on Big Data (BigData). IEEE, 6356–6364

  43. [43]

    Niklas Risse, Jing Liu, and Marcel Böhme. 2025. Top score on the wrong exam: On benchmarking in machine learning for vulnerability detection.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 388–410

  44. [44]

    Robert C Seacord. 2008. The CERT C secure coding standard.Pearson Education(2008)

  45. [45]

    Miaomiao Shao and Yuxin Ding. 2024. FVD-DPM: Fine-grained Vulnerability Detection via Conditional Diffusion Probabilistic Models. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, 7375–7392

  46. [46]

    Yu Sheng, Wanting Wen, Linjing Li, and Daniel Zeng. 2025. Evaluating Generalization Capability of Language Models across Abductive, Deductive and Inductive Logical Reasoning. InProceedings of the 31st International Conference on Computational Linguistics. 4945–4957

  47. [47]

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights.arXiv preprint arXiv:2502.07049(2025)

  48. [48]

    Samiha Shimmi, Ashiqur Rahman, Mohan Gadde, Hamed Okhravi, and Mona Rahimi. 2024. VulSim: Leveraging Similarity of Multi-Dimensional Neighbor Embeddings for Vulnerability Detection. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, 1777–1794. , Vol. 1, No. 1, Article . Publication date: May 2026. ReasonVul: A Multi-perspective Rea...

  49. [49]

    Benjamin Steenhoek, Hongyang Gao, and Wei Le. 2024. Dataflow Analysis-Inspired Deep Learning for Efficient Vul- nerability Detection. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. Association for Computing Machinery. doi:10.1145/3597503.3623345

  50. [50]

    Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023. An Empirical Study of Deep Learning Models for Vulnerability Detection. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2237–2248. doi:10.1109/ICSE48619.2023.00188

  51. [51]

    Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T Barr, and Wei Le. 2024. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218(2024)

  52. [52]

    Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, et al. 2024. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173(2024)

  53. [53]

    Shangwen Wang, Bo Lin, Liqian Chen, and Xiaoguang Mao. 2025. Divide-and-Conquer: Automating Code Revisions via Localization-and-Revision. 34, 3 (2025). doi:10.1145/3697013

  54. [54]

    Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, and Qing Liao. 2024. Reposvul: A repository-level high-quality vulnerability dataset. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 472–483

  55. [55]

    Zhiyuan Wei, Jing Sun, Yuqiang Sun, Ye Liu, Daoyuan Wu, Zijian Zhang, Xianhao Zhang, Meng Li, Yang Liu, Chunmiao Li, et al . 2025. Advanced Smart Contract Vulnerability Detection via LLM-Powered Multi-Agent Systems.IEEE Transactions on Software Engineering(2025)

  56. [56]

    Xin-Cheng Wen, Yupan Chen, Cuiyun Gao, Hongyu Zhang, Jie M Zhang, and Qing Liao. 2023. Vulnerability detection with graph simplification and enhanced graph representation learning. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2275–2286

  57. [57]

    Xin-Cheng Wen, Xinchen Wang, Yujia Chen, Ruida Hu, David Lo, and Cuiyun Gao. 2024. Vuleval: Towards repository- level evaluation of software vulnerability detection.arXiv preprint arXiv:2404.15596(2024)

  58. [58]

    Xin-Cheng Wen, Jiaxin Ye, Cuiyun Gao, Lianwei Wu, and Qing Liao. 2024. EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment.arXiv preprint arXiv:2501.14737(2024)

  59. [59]

    Ratnadira Widyasari, Martin Weyssow, Ivana Clairine Irsan, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, and David Lo. 2025. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents.arXiv preprint arXiv:2505.10961(2025)

  60. [60]

    Yueming Wu, Deqing Zou, Shihan Dou, Wei Yang, Duo Xu, and Hai Jin. 2022. Vulcnn: An image-inspired scalable vulnerability detection system. InProceedings of the 44th International Conference on Software Engineering. 2365–2376

  61. [61]

    Yuying Xia, Haijian Shao, and Xing Deng. 2024. Vulcobert: A codebert-based system for source code vulnerability detection. InProceedings of the 2024 international conference on generative artificial intelligence and information security. 249–252

  62. [62]

    Xu Yang, Shaowei Wang, Yi Li, and Shaohua Wang. 2023. Does data sampling improve deep learning-based vulnerability detection? yeas! and nays!. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2287–2298

  63. [63]

    Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu. 2025. One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE).Proceedings of the ACM on Software Engineering2, FSE (2025), 446–464

  64. [64]

    Alperen Yildiz, Sin G Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil M Divakaran. 2025. Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories.arXiv preprint arXiv:2503.03586(2025)

  65. [65]

    Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering(2024)

  66. [66]

    Jeffy Yu. 2024. Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection.arXiv preprint arXiv:2407.14838(2024)

  67. [67]

    Chenyuan Zhang, Hao Liu, Jiutian Zeng, Kejing Yang, Yuhong Li, and Hui Li. 2024. Prompt-enhanced software vulnerability detection using chatgpt. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 276–277

  68. [68]

    Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Xudong Liu, Chunming Hu, and Yang Liu. 2023. Detecting condition-related bugs with control flow graph neural network. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1370–1382

  69. [69]

    Tiehua Zhang, Rui Xu, Jianping Zhang, Yuze Liu, Xin Chen, Jun Yin, and Xi Zheng. 2024. DSHGT: Dual-Supervisors Heterogeneous Graph Transformer—A Pioneer Study of Using Heterogeneous Graph Learning for Detecting Software Vulnerabilities.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–31

  70. [70]

    Yu Zhao, Lina Gong, Zhiqiu Huang, Yongwei Wang, Mingqiang Wei, and Fei Wu. 2024. Coding-ptms: How to find optimal code pre-trained models for code embedding in vulnerability detection?. InProceedings of the 39th IEEE/ACM , Vol. 1, No. 1, Article . Publication date: May 2026. 24 X. Peng, B. Lin, j. Wang, X. Li, J. Ma, J. Yu, X. Mao, S. Wang International C...

  71. [71]

    Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–31

  72. [72]

    Xin Zhou, Ting Zhang, and David Lo. 2024. Large language model for vulnerability detection: Emerging results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results. 47–51

  73. [73]

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)

  74. [74]

    Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao

  75. [75]

    ACM Softw

    MiSum: Multi-modality Heterogeneous Code Graph Learning for Multi-intent Binary Code Summarization.Proc. ACM Softw. Eng.2, FSE (2025). doi:10.1145/3715780

  76. [76]

    Yuan Zhuang, Zhenguang Liu, Peng Qian, Qi Liu, Xiang Wang, and Qinming He. 2021. Smart contract vulnerability detection using graph neural networks. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence(Yokohama, Yokohama, Japan)(IJCAI’20). Article 454, 8 pages

  77. [77]

    Arastoo Zibaeirad and Marco Vieira. 2024. Vulnllmeval: A framework for evaluating large language models in software vulnerability detection and patching.arXiv preprint arXiv:2409.10756(2024)

  78. [78]

    Arastoo Zibaeirad and Marco Vieira. 2025. Reasoning with llms for zero-shot vulnerability detection.arXiv preprint arXiv:2503.17885(2025). , Vol. 1, No. 1, Article . Publication date: May 2026