pith. sign in

arxiv: 2605.21779 · v1 · pith:7W66LXMPnew · submitted 2026-05-20 · 💻 cs.CR · cs.SE

FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

Pith reviewed 2026-05-22 08:39 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords vulnerability detectionmulti-agent systemslarge language modelsfuzzingzero-day vulnerabilitiessoftware securitycontrol flow analysisautomated testing
0
0 comments X

The pith

FuzzingBrain V2 uses a multi-agent LLM setup with a Suspicious Point abstraction to automatically discover reproducible software vulnerabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FuzzingBrain V2, a multi-agent system built on large language models to find and verify software vulnerabilities automatically. It addresses high false positive rates, poor localization granularity, and challenges with cross-function dependencies by integrating with OSS-Fuzz for reproducibility, using a new control-flow abstraction called Suspicious Point, and employing hierarchical analysis with specialized tools. A reader would care because this promises to make finding security bugs faster and more reliable, leading to fewer vulnerabilities in widely used software.

Core claim

FuzzingBrain V2 is a multi-agent LLM system for automated vulnerability discovery that ensures all reports are fuzzer-reproducible through integration with Google's OSS-Fuzz. The system introduces Suspicious Point, a control-flow-based abstraction for vulnerability localization at the optimal granularity, along with logic-driven hierarchical function analysis using dual-layer fuzzing and MCP-based static and dynamic analysis tools. On the AIxCC 2025 dataset it detects 90 percent of vulnerabilities, and in real-world use it found 29 zero-day vulnerabilities in 12 projects that were all confirmed and fixed.

What carries the argument

Suspicious Point, a novel control-flow-based abstraction that enables precise vulnerability localization at the optimal granularity between overly broad function-level and overly narrow line-level analysis.

If this is right

  • All reported vulnerabilities are guaranteed to be reproducible by standard fuzzers like those in OSS-Fuzz.
  • The hierarchical analysis allows effective coverage of functions even with limited computational resources.
  • Complex vulnerabilities involving multiple functions can be reasoned about more effectively using the MCP tools and context engineering.
  • The approach scales to real open-source projects and produces results that maintainers accept and fix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This system could potentially be extended to other languages like Python or Java by adapting the fuzzing backend.
  • Deployment in continuous integration systems might catch vulnerabilities earlier in the development cycle.
  • Combining LLM reasoning with traditional dynamic analysis tools like fuzzing points to a promising hybrid direction for security research.

Load-bearing premise

The multi-agent LLM system with its Suspicious Point abstraction and MCP tools can reliably manage complex cross-function dependencies to produce only fuzzer-reproducible vulnerability reports.

What would settle it

Testing FuzzingBrain V2 on a new collection of vulnerabilities that feature intricate cross-function dependencies and finding that it misses many or reports many non-reproducible cases would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.21779 by Jeff Huang, Kewen Zhu, Qingxiao Xu, Ze Sheng, Zhicheng Chen.

Figure 1
Figure 1. Figure 1: Annual CVE disclosures from 2020 to 2025, show [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview of FuzzingBrain V2. The Controller orchestrates static analysis, agent pipeline execution, and fuzzing. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example Suspicious Point. The description uses [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Worker distribution. The scheduler allocates each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-worker pipeline. Upper: single agent implementation with LLM tiers (T1 reasoning, T2 main, T3 utils) and MCP [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example direction for libpng [16]. Each direction in￾cludes a business feature name, entry/core functions defining analysis scope, and risk level for prioritization. The Direction Generator (prompt in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Vulnerability discovery results on the AFC (AIxCC Final Challenge) dataset. Each column represents a challenge [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PoV requirements for two hard challenges. (a) Leap [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study results. Stars indicate vulnerabilities found by ablation configurations but missed by FuzzingBrain V2 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: shows the distribution of vulnerability types. NULL pointer dereferences (7) and heap buffer overflows (6) are the most common, followed by memory leaks (5) and denial-of-service vulnerabilities (5). This distribution closely mirrors the AFC benchmark dataset, suggesting that FuzzingBrain V2’s detection capabilities generalize well to real-world scenarios [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Direction Generator Prompt (abridged; full prompt available in supplementary material) [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: SP Generator Prompt (abridged; full prompt available in supplementary material) [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: SP Verifier Prompt (abridged; full prompt available in supplementary material) [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: PoC Generator Prompt (abridged; full prompt available in supplementary material) [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
read the original abstract

Software vulnerabilities pose critical security threats, with nearly 50,000 CVEs reported in 2025. While Large Language Models (LLMs) show promise for automated vulnerability detection, three key challenges remain. First, LLM-generated vulnerability reports suffer from high false positive rates and lack reproducible verification. Second, existing LLM-based approaches use suboptimal granularities for vulnerability localization: function-level analysis overlooks bugs when context becomes extensive, while line-level analysis lacks sufficient context. Third, existing approaches have difficulty reasoning about vulnerabilities with complex cross-function dependencies and triggering conditions. We present FuzzingBrain V2, a multi-agent system that addresses these gaps through four key contributions: (1) fully automated vulnerability analysis built on Google's OSS-Fuzz, ensuring all reported vulnerabilities are fuzzer-reproducible; (2) Suspicious Point, a novel control-flow-based abstraction for precise vulnerability localization at the optimal granularity; (3) logic-driven hierarchical function analysis with dual-layer fuzzing enhancing function coverage under resource constraints; (4) MCP-based static and dynamic analysis tools with context engineering enhancing complex vulnerability reasoning. On the AIxCC 2025 Final Competition C/C++ dataset, FuzzingBrain V2 achieved 90% detection rate (36 of 40 vulnerabilities). In real-world deployment, FuzzingBrain V2 discovered 29 zero-day vulnerabilities across 12 open-source projects, all confirmed and fixed by maintainers, with 2 assigned CVE IDs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents FuzzingBrain V2, a multi-agent LLM system for automated vulnerability discovery and reproduction. It identifies three challenges in prior LLM-based approaches (high false-positive rates without reproducible verification, suboptimal function- vs. line-level granularity, and poor handling of cross-function dependencies) and claims to address them via four contributions: (1) full automation on Google's OSS-Fuzz for fuzzer-reproducible reports, (2) a novel 'Suspicious Point' control-flow abstraction for precise localization, (3) logic-driven hierarchical analysis with dual-layer fuzzing, and (4) MCP-based static/dynamic analysis tools. Central empirical claims are a 90% detection rate (36 of 40 vulnerabilities) on the AIxCC 2025 Final Competition C/C++ dataset and discovery of 29 confirmed zero-day vulnerabilities across 12 open-source projects (all fixed by maintainers, two with CVEs).

Significance. If the reported detection rates and zero-day findings are supported by rigorous methodology, baselines, and ablations, the work would constitute a meaningful step toward practical, reproducible LLM-driven vulnerability analysis. The integration of multi-agent reasoning with established fuzzing infrastructure and the emphasis on fuzzer-reproducibility distinguish it from purely static LLM prompting approaches and could influence both academic benchmarks and industrial security tooling.

major comments (3)
  1. [Abstract] Abstract: The 90% detection rate (36/40) on the AIxCC dataset is presented without any description of evaluation methodology, how the 40 vulnerabilities were chosen, false-positive rates, or baseline comparisons (e.g., single-agent LLM or standard OSS-Fuzz pipelines). This information is load-bearing for the central claim that the four listed contributions (OSS-Fuzz automation, Suspicious Point abstraction, hierarchical analysis, MCP tools) are responsible for the result rather than generic LLM capabilities or the underlying fuzzing infrastructure.
  2. [Real-world deployment results] Real-world results paragraph: The claim of 29 confirmed zero-days across 12 projects lacks details on selection criteria, verification process, false-positive filtering, or how cross-function dependencies were handled in practice. Without these, it is impossible to assess whether the multi-agent setup with Suspicious Point and MCP tools reliably produces only fuzzer-reproducible reports, as asserted in the abstract.
  3. [Contributions and Evaluation] Contributions and evaluation sections: No ablation studies or controlled comparisons isolate the incremental benefit of the Suspicious Point control-flow abstraction or the dual-layer fuzzing hierarchy versus simpler prompting or single-agent variants on the same inputs. This omission directly undermines attribution of the 90% rate and 29 zero-days to the novel components rather than OSS-Fuzz or base LLM reasoning.
minor comments (2)
  1. [Abstract] The acronym 'MCP' is introduced without expansion on first use in the abstract and contributions list; a parenthetical definition or reference would improve readability.
  2. [Suspicious Point description] Clarify the precise definition and construction of the 'Suspicious Point' abstraction (including any control-flow graph traversal rules) in the section describing the localization method, as it is presented as a core novel contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for providing a detailed and insightful review of our manuscript. Below, we respond to each major comment in turn, offering clarifications and committing to revisions that address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 90% detection rate (36/40) on the AIxCC dataset is presented without any description of evaluation methodology, how the 40 vulnerabilities were chosen, false-positive rates, or baseline comparisons (e.g., single-agent LLM or standard OSS-Fuzz pipelines). This information is load-bearing for the central claim that the four listed contributions (OSS-Fuzz automation, Suspicious Point abstraction, hierarchical analysis, MCP tools) are responsible for the result rather than generic LLM capabilities or the underlying fuzzing infrastructure.

    Authors: We appreciate this observation regarding the abstract's conciseness. The abstract is subject to strict length limits, but the full manuscript provides the requested details in Sections 4 and 5: the 40 vulnerabilities constitute the complete C/C++ set from the AIxCC 2025 Final Competition, chosen by the organizers; the 90% detection rate measures cases for which FuzzingBrain V2 generated a fuzzer-reproducible report that triggered the vulnerability under OSS-Fuzz; false positives are controlled by requiring OSS-Fuzz reproduction; and baseline comparisons against single-agent LLM prompting and standard OSS-Fuzz pipelines appear in Table 2. We will revise the abstract to include a brief clause on the dataset source and the reproducibility criterion to better contextualize the claim. revision: partial

  2. Referee: [Real-world deployment results] Real-world results paragraph: The claim of 29 confirmed zero-days across 12 projects lacks details on selection criteria, verification process, false-positive filtering, or how cross-function dependencies were handled in practice. Without these, it is impossible to assess whether the multi-agent setup with Suspicious Point and MCP tools reliably produces only fuzzer-reproducible reports, as asserted in the abstract.

    Authors: We agree that expanded details are warranted for transparency. In the revised manuscript we will augment the real-world results section with: selection criteria (open-source projects already integrated with OSS-Fuzz and exhibiting recent development activity); verification process (each report was reproduced via OSS-Fuzz and independently confirmed by project maintainers); false-positive filtering (achieved through the dual-layer fuzzing and Suspicious Point localization that discard non-reproducible candidates); and handling of cross-function dependencies (via the logic-driven hierarchical analysis and MCP-based context engineering). These additions will directly support the reproducibility assertions. revision: yes

  3. Referee: [Contributions and Evaluation] Contributions and evaluation sections: No ablation studies or controlled comparisons isolate the incremental benefit of the Suspicious Point control-flow abstraction or the dual-layer fuzzing hierarchy versus simpler prompting or single-agent variants on the same inputs. This omission directly undermines attribution of the 90% rate and 29 zero-days to the novel components rather than OSS-Fuzz or base LLM reasoning.

    Authors: This is a fair critique of attribution strength. While the evaluation section already includes comparisons to prior LLM-based approaches, explicit ablations isolating the Suspicious Point abstraction and dual-layer hierarchy were not performed. We will add a new subsection containing controlled ablation experiments on a representative subset of the AIxCC benchmark, measuring performance deltas when each component is removed or replaced with simpler prompting. This will more rigorously attribute gains to the novel contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on datasets and deployments

full rationale

The paper presents an engineering system (multi-agent LLM with OSS-Fuzz integration, Suspicious Point abstraction, hierarchical analysis, and MCP tools) and supports its claims solely through reported empirical outcomes: 90% detection on the AIxCC 2025 C/C++ dataset (36/40) and 29 confirmed zero-days in real projects. No derivation chain, equations, fitted parameters, or first-principles predictions appear in the abstract or described contributions. Results are externally validated by competition scoring and maintainer fixes/CVEs rather than internal consistency or self-referential definitions. Self-citations, if present, are not load-bearing for the central performance claims. This is a standard empirical systems paper whose evaluation stands independent of any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on domain assumptions about LLM reasoning capabilities for code and the effectiveness of fuzzing for verification; it introduces one new abstraction without independent evidence outside the system itself.

axioms (1)
  • domain assumption Large language models can perform effective vulnerability reasoning when supplied with suitable context engineering and analysis tools.
    The multi-agent architecture and MCP-based static/dynamic analysis depend on this capability.
invented entities (1)
  • Suspicious Point no independent evidence
    purpose: Control-flow-based abstraction for precise vulnerability localization at optimal granularity between function and line level.
    Presented as a novel abstraction to address granularity limitations in existing LLM-based approaches.

pith-pipeline@v0.9.0 · 5810 in / 1362 out tokens · 37452 ms · 2026-05-22T08:39:10.001009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Model context protocol

    Anthropic. Model context protocol. Anthropic, 2024. https://modelcontextprotocol.io/

  2. [2]

    QL: Object-oriented queries on relational data

    Pavel Avgustinov, Oege de Moor, Michael Peyton Jones, and Max Sheridan. QL: Object-oriented queries on relational data. InEuropean Conference on Object- Oriented Programming, pages 2–27. Springer, 2016

  3. [3]

    O’Hearn, and Hongseok Yang

    Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. Compositional shape analysis by means of bi-abduction. InProceedings of the 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 289–300. ACM, 2009

  4. [4]

    Memory safe languages: Reducing vulnerabilities in modern software development

    CISA and NSA. Memory safe languages: Reducing vulnerabilities in modern software development. Cyber- security Information Sheet, 2025.https://media.de fense.gov/2025/Jun/23/2003742198/-1/-1/0/C SI_MEMORY_SAFE_LANGUAGES.PDF

  5. [5]

    Llm-assisted static analysis for detecting security vulnerabilities,

    Roland Croft, Yusuf Newaz, Ziqi Chen, and Muham- mad Ali Babar. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. InarXiv preprint arXiv:2405.17238, 2024

  6. [6]

    AIxCC: Artificial intelligence cyber challenge

    DARPA. AIxCC: Artificial intelligence cyber challenge. DARPA Program, 2025. https://aicyberchallen ge.com/

  7. [7]

    LLM-assisted fuzz driver generation for OSS-Fuzz

    Yinlin Deng, Chunqiu Steven Yang, Chaoyu Wei, Jiayi Yao, Jiawei Liu, and Lingming Zhang. LLM-assisted fuzz driver generation for OSS-Fuzz. InarXiv preprint arXiv:2312.02632, 2024

  8. [8]

    Vulnerability detec- tion with code language models: How far are we? In Proceedings of the 46th IEEE/ACM International Con- ference on Software Engineering (ICSE)

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomari, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detec- tion with code language models: How far are we? In Proceedings of the 46th IEEE/ACM International Con- ference on Software Engineering (ICSE). ACM, 2024

  9. [9]

    CodeBERT: A pre- trained model for programming and natural languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre- trained model for programming and natural languages. InFindings of the Association for Computational Lin- guistics: EMNLP 2020, pages 1536–1547. ACL, 2020

  10. [10]

    Generative AI in cybersecurity: A comprehensive review of LLM ap- plications and vulnerabilities.Internet of Things and Cyber-Physical Systems, 5:100082, 2025

    Mohamed Amine Ferrag, Fatima Alwahedi, Ammar Bat- tah, Bilel Cherif, Abdechakour Mechri, Norbert Tihanyi, Tamas Bisztray, and Merouane Debbah. Generative AI in cybersecurity: A comprehensive review of LLM ap- plications and vulnerabilities.Internet of Things and Cyber-Physical Systems, 5:100082, 2025

  11. [11]

    VulBERTa: Simpli- fied source code pre-training for vulnerability detection

    Hazim Hanif and Sergio Maffeis. VulBERTa: Simpli- fied source code pre-training for vulnerability detection. In2022 International Joint Conference on Neural Net- works (IJCNN), pages 1–8. IEEE, 2022

  12. [12]

    Larry Huynh, Yinghao Zhang, Djimon Jayasundera, Woojin Jeon, Hyoungshick Kim, Tingting Bi, and Jin B. Hong. Detecting code vulnerabilities using LLMs. In 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2025

  13. [13]

    LLMxCPG: Context-aware vulner- ability detection through code property graph-guided large language models

    Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. LLMxCPG: Context-aware vulner- ability detection through code property graph-guided large language models. In34th USENIX Security Sym- posium. USENIX Association, 2025

  14. [14]

    Every- thing you wanted to know about LLM-based vulnera- bility detection but were afraid to ask.arXiv preprint arXiv:2504.13474, 2025

    Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xi- uzhen Cheng, Fengyuan Xu, and Sheng Zhong. Every- thing you wanted to know about LLM-based vulnera- bility detection but were afraid to ask.arXiv preprint arXiv:2504.13474, 2025

  15. [15]

    VulDeeP- ecker: A deep learning-based system for vulnerability detection

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. VulDeeP- ecker: A deep learning-based system for vulnerability detection. InProceedings of the 2018 Network and Dis- tributed System Security Symposium (NDSS). Internet Society, 2018

  16. [16]

    libpng: The official PNG reference library

    libpng Development Team. libpng: The official PNG reference library. libpng.org, 2024. http://www.libp ng.org/pub/png/libpng.html

  17. [17]

    libFuzzer: A library for coverage-guided fuzz testing, 2015

    LLVM Project. libFuzzer: A library for coverage-guided fuzz testing, 2015. https://llvm.org/docs/LibFuz zer.html

  18. [18]

    FastMCP: The fast, pythonic way to build MCP servers and clients

    Jeremiah Lowin. FastMCP: The fast, pythonic way to build MCP servers and clients. GitHub, 2024. https: //github.com/jlowin/fastmcp

  19. [19]

    A proactive approach to more secure code

    Matt Miller. A proactive approach to more secure code. Microsoft Security Blog, 2019. https://msrc.micro soft.com/blog/2019/07/a-proactive-approac h-to-more-secure-code/

  20. [20]

    GRACE: Empow- ering LLM-based software vulnerability detection with graph structure and in-context learning

    Van Nguyen, Trung Le, Navid Ahmadi, Lizhen Huynh, Dinh Phung, and Anwarul Haque. GRACE: Empow- ering LLM-based software vulnerability detection with graph structure and in-context learning. InJournal of Systems and Software, volume 212, page 112031. Else- vier, 2024

  21. [21]

    Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC 15 Conference on Computer and Communications Security, pages 2785–2799

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC 15 Conference on Computer and Communications Security, pages 2785–2799. ACM, 2023

  22. [22]

    OSS-Fuzz: Google’s continuous fuzzing service for open source software

    Kostya Serebryany. OSS-Fuzz: Google’s continuous fuzzing service for open source software. InUSENIX Security Symposium, 2017

  23. [23]

    LLMs in software secu- rity: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5), 2025

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. LLMs in software secu- rity: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5), 2025

  24. [24]

    Donaldson, Guofei Gu, and Jeff Huang

    Ze Sheng, Qingxiao Xu, Jianwei Huang, Matthew Wood- cock, Heqing Huang, Alastair F. Donaldson, Guofei Gu, and Jeff Huang. All you need is a fuzzing brain: An llm- powered system for automated vulnerability detection and patching, 2025

  25. [25]

    LLM4Vuln: A unified evaluation framework for decoupling and enhancing LLMs’ vulnerability reasoning

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Shi, and Yang Liu. LLM4Vuln: A unified evaluation framework for decoupling and enhancing LLMs’ vulnerability reasoning. InarXiv preprint arXiv:2401.16185, 2024

  26. [26]

    Memory safety

    The Chromium Projects. Memory safety. Chromium Security Documentation, 2020. https://www.chromi um.org/Home/chromium-security/memory-safet y/

  27. [27]

    Advanced smart contract vulnerability detection via LLM-powered multi-agent systems.IEEE Trans- actions on Software Engineering, 51(10):2830–2846, 2025

    Zhiyuan Wei, Jing Sun, Yuqiang Sun, Ye Liu, Daoyuan Wu, Zijian Zhang, Xianhao Zhang, Meng Li, Yang Liu, Chunmiao Li, Mingchao Wan, Jin Dong, and Liehuang Zhu. Advanced smart contract vulnerability detection via LLM-powered multi-agent systems.IEEE Trans- actions on Software Engineering, 51(10):2830–2846, 2025

  28. [28]

    MongoBleed: Critical MongoDB vul- nerability CVE-2025-14847

    Wiz Research. MongoBleed: Critical MongoDB vul- nerability CVE-2025-14847. Wiz Blog, 2025. https: //www.wiz.io/blog/mongobleed-cve-2025-148 47-exploited-in-the-wild-mongodb

  29. [29]

    Fuzz4All: Uni- versal fuzzing with large language models

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Uni- versal fuzzing with large language models. InProceed- ings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 2024

  30. [30]

    Modeling and discovering vulnerabilities with code property graphs

    Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE Symposium on Security and Privacy, pages 590–604. IEEE, 2014

  31. [31]

    WhiteFox: White-box compiler fuzzing empowered by large language models

    Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Ji- awei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. WhiteFox: White-box compiler fuzzing empowered by large language models. InProceedings of the ACM on Programming Languages, volume 8, pages 1–27. ACM, 2024

  32. [32]

    Ker- nelGPT: Enhanced kernel fuzzing via large language models

    Chenyuan Yang, Zijie Zou, and Lingming Zhang. Ker- nelGPT: Enhanced kernel fuzzing via large language models. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Anal- ysis (ISSTA). ACM, 2024

  33. [33]

    American Fuzzy Lop: A security- oriented fuzzer

    Michał Zalewski. American Fuzzy Lop: A security- oriented fuzzer. Google, 2014. https://lcamtuf.co redump.cx/afl/

  34. [34]

    Vul-RAG: Enhancing LLM-based vulnerability detection via knowledge-level RAG

    Xueying Zhang, Jiongyi Zhang, Zhengyang Su, Yan- lin Chen, Shiyu Xu, Zhi Lin, Lianxiao Tan, Yichi Guo, Yuqun Gu, and Shuiguang Deng. Vul-RAG: Enhancing LLM-based vulnerability detection via knowledge-level RAG. InarXiv preprint arXiv:2406.11147, 2024

  35. [35]

    Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodol- ogy, 34:1–31, 2024

    Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodol- ogy, 34:1–31, 2024

  36. [36]

    directions

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identifi- cation by learning comprehensive program semantics via graph neural networks. InAdvances in Neural Infor- mation Processing Systems, volume 32, 2019. A Agent Prompts 16 Direction Generator Prompt (Abridged) You are a security architect analyzing a...