FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction
Pith reviewed 2026-05-22 08:39 UTC · model grok-4.3
The pith
FuzzingBrain V2 uses a multi-agent LLM setup with a Suspicious Point abstraction to automatically discover reproducible software vulnerabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FuzzingBrain V2 is a multi-agent LLM system for automated vulnerability discovery that ensures all reports are fuzzer-reproducible through integration with Google's OSS-Fuzz. The system introduces Suspicious Point, a control-flow-based abstraction for vulnerability localization at the optimal granularity, along with logic-driven hierarchical function analysis using dual-layer fuzzing and MCP-based static and dynamic analysis tools. On the AIxCC 2025 dataset it detects 90 percent of vulnerabilities, and in real-world use it found 29 zero-day vulnerabilities in 12 projects that were all confirmed and fixed.
What carries the argument
Suspicious Point, a novel control-flow-based abstraction that enables precise vulnerability localization at the optimal granularity between overly broad function-level and overly narrow line-level analysis.
If this is right
- All reported vulnerabilities are guaranteed to be reproducible by standard fuzzers like those in OSS-Fuzz.
- The hierarchical analysis allows effective coverage of functions even with limited computational resources.
- Complex vulnerabilities involving multiple functions can be reasoned about more effectively using the MCP tools and context engineering.
- The approach scales to real open-source projects and produces results that maintainers accept and fix.
Where Pith is reading between the lines
- This system could potentially be extended to other languages like Python or Java by adapting the fuzzing backend.
- Deployment in continuous integration systems might catch vulnerabilities earlier in the development cycle.
- Combining LLM reasoning with traditional dynamic analysis tools like fuzzing points to a promising hybrid direction for security research.
Load-bearing premise
The multi-agent LLM system with its Suspicious Point abstraction and MCP tools can reliably manage complex cross-function dependencies to produce only fuzzer-reproducible vulnerability reports.
What would settle it
Testing FuzzingBrain V2 on a new collection of vulnerabilities that feature intricate cross-function dependencies and finding that it misses many or reports many non-reproducible cases would show the central claim does not hold.
Figures
read the original abstract
Software vulnerabilities pose critical security threats, with nearly 50,000 CVEs reported in 2025. While Large Language Models (LLMs) show promise for automated vulnerability detection, three key challenges remain. First, LLM-generated vulnerability reports suffer from high false positive rates and lack reproducible verification. Second, existing LLM-based approaches use suboptimal granularities for vulnerability localization: function-level analysis overlooks bugs when context becomes extensive, while line-level analysis lacks sufficient context. Third, existing approaches have difficulty reasoning about vulnerabilities with complex cross-function dependencies and triggering conditions. We present FuzzingBrain V2, a multi-agent system that addresses these gaps through four key contributions: (1) fully automated vulnerability analysis built on Google's OSS-Fuzz, ensuring all reported vulnerabilities are fuzzer-reproducible; (2) Suspicious Point, a novel control-flow-based abstraction for precise vulnerability localization at the optimal granularity; (3) logic-driven hierarchical function analysis with dual-layer fuzzing enhancing function coverage under resource constraints; (4) MCP-based static and dynamic analysis tools with context engineering enhancing complex vulnerability reasoning. On the AIxCC 2025 Final Competition C/C++ dataset, FuzzingBrain V2 achieved 90% detection rate (36 of 40 vulnerabilities). In real-world deployment, FuzzingBrain V2 discovered 29 zero-day vulnerabilities across 12 open-source projects, all confirmed and fixed by maintainers, with 2 assigned CVE IDs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FuzzingBrain V2, a multi-agent LLM system for automated vulnerability discovery and reproduction. It identifies three challenges in prior LLM-based approaches (high false-positive rates without reproducible verification, suboptimal function- vs. line-level granularity, and poor handling of cross-function dependencies) and claims to address them via four contributions: (1) full automation on Google's OSS-Fuzz for fuzzer-reproducible reports, (2) a novel 'Suspicious Point' control-flow abstraction for precise localization, (3) logic-driven hierarchical analysis with dual-layer fuzzing, and (4) MCP-based static/dynamic analysis tools. Central empirical claims are a 90% detection rate (36 of 40 vulnerabilities) on the AIxCC 2025 Final Competition C/C++ dataset and discovery of 29 confirmed zero-day vulnerabilities across 12 open-source projects (all fixed by maintainers, two with CVEs).
Significance. If the reported detection rates and zero-day findings are supported by rigorous methodology, baselines, and ablations, the work would constitute a meaningful step toward practical, reproducible LLM-driven vulnerability analysis. The integration of multi-agent reasoning with established fuzzing infrastructure and the emphasis on fuzzer-reproducibility distinguish it from purely static LLM prompting approaches and could influence both academic benchmarks and industrial security tooling.
major comments (3)
- [Abstract] Abstract: The 90% detection rate (36/40) on the AIxCC dataset is presented without any description of evaluation methodology, how the 40 vulnerabilities were chosen, false-positive rates, or baseline comparisons (e.g., single-agent LLM or standard OSS-Fuzz pipelines). This information is load-bearing for the central claim that the four listed contributions (OSS-Fuzz automation, Suspicious Point abstraction, hierarchical analysis, MCP tools) are responsible for the result rather than generic LLM capabilities or the underlying fuzzing infrastructure.
- [Real-world deployment results] Real-world results paragraph: The claim of 29 confirmed zero-days across 12 projects lacks details on selection criteria, verification process, false-positive filtering, or how cross-function dependencies were handled in practice. Without these, it is impossible to assess whether the multi-agent setup with Suspicious Point and MCP tools reliably produces only fuzzer-reproducible reports, as asserted in the abstract.
- [Contributions and Evaluation] Contributions and evaluation sections: No ablation studies or controlled comparisons isolate the incremental benefit of the Suspicious Point control-flow abstraction or the dual-layer fuzzing hierarchy versus simpler prompting or single-agent variants on the same inputs. This omission directly undermines attribution of the 90% rate and 29 zero-days to the novel components rather than OSS-Fuzz or base LLM reasoning.
minor comments (2)
- [Abstract] The acronym 'MCP' is introduced without expansion on first use in the abstract and contributions list; a parenthetical definition or reference would improve readability.
- [Suspicious Point description] Clarify the precise definition and construction of the 'Suspicious Point' abstraction (including any control-flow graph traversal rules) in the section describing the localization method, as it is presented as a core novel contribution.
Simulated Author's Rebuttal
We are grateful to the referee for providing a detailed and insightful review of our manuscript. Below, we respond to each major comment in turn, offering clarifications and committing to revisions that address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The 90% detection rate (36/40) on the AIxCC dataset is presented without any description of evaluation methodology, how the 40 vulnerabilities were chosen, false-positive rates, or baseline comparisons (e.g., single-agent LLM or standard OSS-Fuzz pipelines). This information is load-bearing for the central claim that the four listed contributions (OSS-Fuzz automation, Suspicious Point abstraction, hierarchical analysis, MCP tools) are responsible for the result rather than generic LLM capabilities or the underlying fuzzing infrastructure.
Authors: We appreciate this observation regarding the abstract's conciseness. The abstract is subject to strict length limits, but the full manuscript provides the requested details in Sections 4 and 5: the 40 vulnerabilities constitute the complete C/C++ set from the AIxCC 2025 Final Competition, chosen by the organizers; the 90% detection rate measures cases for which FuzzingBrain V2 generated a fuzzer-reproducible report that triggered the vulnerability under OSS-Fuzz; false positives are controlled by requiring OSS-Fuzz reproduction; and baseline comparisons against single-agent LLM prompting and standard OSS-Fuzz pipelines appear in Table 2. We will revise the abstract to include a brief clause on the dataset source and the reproducibility criterion to better contextualize the claim. revision: partial
-
Referee: [Real-world deployment results] Real-world results paragraph: The claim of 29 confirmed zero-days across 12 projects lacks details on selection criteria, verification process, false-positive filtering, or how cross-function dependencies were handled in practice. Without these, it is impossible to assess whether the multi-agent setup with Suspicious Point and MCP tools reliably produces only fuzzer-reproducible reports, as asserted in the abstract.
Authors: We agree that expanded details are warranted for transparency. In the revised manuscript we will augment the real-world results section with: selection criteria (open-source projects already integrated with OSS-Fuzz and exhibiting recent development activity); verification process (each report was reproduced via OSS-Fuzz and independently confirmed by project maintainers); false-positive filtering (achieved through the dual-layer fuzzing and Suspicious Point localization that discard non-reproducible candidates); and handling of cross-function dependencies (via the logic-driven hierarchical analysis and MCP-based context engineering). These additions will directly support the reproducibility assertions. revision: yes
-
Referee: [Contributions and Evaluation] Contributions and evaluation sections: No ablation studies or controlled comparisons isolate the incremental benefit of the Suspicious Point control-flow abstraction or the dual-layer fuzzing hierarchy versus simpler prompting or single-agent variants on the same inputs. This omission directly undermines attribution of the 90% rate and 29 zero-days to the novel components rather than OSS-Fuzz or base LLM reasoning.
Authors: This is a fair critique of attribution strength. While the evaluation section already includes comparisons to prior LLM-based approaches, explicit ablations isolating the Suspicious Point abstraction and dual-layer hierarchy were not performed. We will add a new subsection containing controlled ablation experiments on a representative subset of the AIxCC benchmark, measuring performance deltas when each component is removed or replaced with simpler prompting. This will more rigorously attribute gains to the novel contributions. revision: yes
Circularity Check
No circularity: empirical results on datasets and deployments
full rationale
The paper presents an engineering system (multi-agent LLM with OSS-Fuzz integration, Suspicious Point abstraction, hierarchical analysis, and MCP tools) and supports its claims solely through reported empirical outcomes: 90% detection on the AIxCC 2025 C/C++ dataset (36/40) and 29 confirmed zero-days in real projects. No derivation chain, equations, fitted parameters, or first-principles predictions appear in the abstract or described contributions. Results are externally validated by competition scoring and maintainer fixes/CVEs rather than internal consistency or self-referential definitions. Self-citations, if present, are not load-bearing for the central performance claims. This is a standard empirical systems paper whose evaluation stands independent of any tautological reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can perform effective vulnerability reasoning when supplied with suitable context engineering and analysis tools.
invented entities (1)
-
Suspicious Point
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present FuzzingBrain V2, a multi-agent system that addresses these gaps through four key contributions: (1) fully automated vulnerability analysis built on Google's OSS-Fuzz... (2) Suspicious Point, a novel control-flow-based abstraction...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On the AIxCC 2025 Final Competition C/C++ dataset, FuzzingBrain V2 achieved 90% detection rate (36 of 40 vulnerabilities).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Model context protocol. Anthropic, 2024. https://modelcontextprotocol.io/
work page 2024
-
[2]
QL: Object-oriented queries on relational data
Pavel Avgustinov, Oege de Moor, Michael Peyton Jones, and Max Sheridan. QL: Object-oriented queries on relational data. InEuropean Conference on Object- Oriented Programming, pages 2–27. Springer, 2016
work page 2016
-
[3]
Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. Compositional shape analysis by means of bi-abduction. InProceedings of the 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 289–300. ACM, 2009
work page 2009
-
[4]
Memory safe languages: Reducing vulnerabilities in modern software development
CISA and NSA. Memory safe languages: Reducing vulnerabilities in modern software development. Cyber- security Information Sheet, 2025.https://media.de fense.gov/2025/Jun/23/2003742198/-1/-1/0/C SI_MEMORY_SAFE_LANGUAGES.PDF
work page 2025
-
[5]
Llm-assisted static analysis for detecting security vulnerabilities,
Roland Croft, Yusuf Newaz, Ziqi Chen, and Muham- mad Ali Babar. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. InarXiv preprint arXiv:2405.17238, 2024
-
[6]
AIxCC: Artificial intelligence cyber challenge
DARPA. AIxCC: Artificial intelligence cyber challenge. DARPA Program, 2025. https://aicyberchallen ge.com/
work page 2025
-
[7]
LLM-assisted fuzz driver generation for OSS-Fuzz
Yinlin Deng, Chunqiu Steven Yang, Chaoyu Wei, Jiayi Yao, Jiawei Liu, and Lingming Zhang. LLM-assisted fuzz driver generation for OSS-Fuzz. InarXiv preprint arXiv:2312.02632, 2024
-
[8]
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomari, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detec- tion with code language models: How far are we? In Proceedings of the 46th IEEE/ACM International Con- ference on Software Engineering (ICSE). ACM, 2024
work page 2024
-
[9]
CodeBERT: A pre- trained model for programming and natural languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre- trained model for programming and natural languages. InFindings of the Association for Computational Lin- guistics: EMNLP 2020, pages 1536–1547. ACL, 2020
work page 2020
-
[10]
Mohamed Amine Ferrag, Fatima Alwahedi, Ammar Bat- tah, Bilel Cherif, Abdechakour Mechri, Norbert Tihanyi, Tamas Bisztray, and Merouane Debbah. Generative AI in cybersecurity: A comprehensive review of LLM ap- plications and vulnerabilities.Internet of Things and Cyber-Physical Systems, 5:100082, 2025
work page 2025
-
[11]
VulBERTa: Simpli- fied source code pre-training for vulnerability detection
Hazim Hanif and Sergio Maffeis. VulBERTa: Simpli- fied source code pre-training for vulnerability detection. In2022 International Joint Conference on Neural Net- works (IJCNN), pages 1–8. IEEE, 2022
work page 2022
-
[12]
Larry Huynh, Yinghao Zhang, Djimon Jayasundera, Woojin Jeon, Hyoungshick Kim, Tingting Bi, and Jin B. Hong. Detecting code vulnerabilities using LLMs. In 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2025
work page 2025
-
[13]
Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. LLMxCPG: Context-aware vulner- ability detection through code property graph-guided large language models. In34th USENIX Security Sym- posium. USENIX Association, 2025
work page 2025
-
[14]
Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xi- uzhen Cheng, Fengyuan Xu, and Sheng Zhong. Every- thing you wanted to know about LLM-based vulnera- bility detection but were afraid to ask.arXiv preprint arXiv:2504.13474, 2025
-
[15]
VulDeeP- ecker: A deep learning-based system for vulnerability detection
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. VulDeeP- ecker: A deep learning-based system for vulnerability detection. InProceedings of the 2018 Network and Dis- tributed System Security Symposium (NDSS). Internet Society, 2018
work page 2018
-
[16]
libpng: The official PNG reference library
libpng Development Team. libpng: The official PNG reference library. libpng.org, 2024. http://www.libp ng.org/pub/png/libpng.html
work page 2024
-
[17]
libFuzzer: A library for coverage-guided fuzz testing, 2015
LLVM Project. libFuzzer: A library for coverage-guided fuzz testing, 2015. https://llvm.org/docs/LibFuz zer.html
work page 2015
-
[18]
FastMCP: The fast, pythonic way to build MCP servers and clients
Jeremiah Lowin. FastMCP: The fast, pythonic way to build MCP servers and clients. GitHub, 2024. https: //github.com/jlowin/fastmcp
work page 2024
-
[19]
A proactive approach to more secure code
Matt Miller. A proactive approach to more secure code. Microsoft Security Blog, 2019. https://msrc.micro soft.com/blog/2019/07/a-proactive-approac h-to-more-secure-code/
work page 2019
-
[20]
Van Nguyen, Trung Le, Navid Ahmadi, Lizhen Huynh, Dinh Phung, and Anwarul Haque. GRACE: Empow- ering LLM-based software vulnerability detection with graph structure and in-context learning. InJournal of Systems and Software, volume 212, page 112031. Else- vier, 2024
work page 2024
-
[21]
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC 15 Conference on Computer and Communications Security, pages 2785–2799. ACM, 2023
work page 2023
-
[22]
OSS-Fuzz: Google’s continuous fuzzing service for open source software
Kostya Serebryany. OSS-Fuzz: Google’s continuous fuzzing service for open source software. InUSENIX Security Symposium, 2017
work page 2017
-
[23]
Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. LLMs in software secu- rity: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5), 2025
work page 2025
-
[24]
Donaldson, Guofei Gu, and Jeff Huang
Ze Sheng, Qingxiao Xu, Jianwei Huang, Matthew Wood- cock, Heqing Huang, Alastair F. Donaldson, Guofei Gu, and Jeff Huang. All you need is a fuzzing brain: An llm- powered system for automated vulnerability detection and patching, 2025
work page 2025
-
[25]
LLM4Vuln: A unified evaluation framework for decoupling and enhancing LLMs’ vulnerability reasoning
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Shi, and Yang Liu. LLM4Vuln: A unified evaluation framework for decoupling and enhancing LLMs’ vulnerability reasoning. InarXiv preprint arXiv:2401.16185, 2024
-
[26]
The Chromium Projects. Memory safety. Chromium Security Documentation, 2020. https://www.chromi um.org/Home/chromium-security/memory-safet y/
work page 2020
-
[27]
Zhiyuan Wei, Jing Sun, Yuqiang Sun, Ye Liu, Daoyuan Wu, Zijian Zhang, Xianhao Zhang, Meng Li, Yang Liu, Chunmiao Li, Mingchao Wan, Jin Dong, and Liehuang Zhu. Advanced smart contract vulnerability detection via LLM-powered multi-agent systems.IEEE Trans- actions on Software Engineering, 51(10):2830–2846, 2025
work page 2025
-
[28]
MongoBleed: Critical MongoDB vul- nerability CVE-2025-14847
Wiz Research. MongoBleed: Critical MongoDB vul- nerability CVE-2025-14847. Wiz Blog, 2025. https: //www.wiz.io/blog/mongobleed-cve-2025-148 47-exploited-in-the-wild-mongodb
work page 2025
-
[29]
Fuzz4All: Uni- versal fuzzing with large language models
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Uni- versal fuzzing with large language models. InProceed- ings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 2024
work page 2024
-
[30]
Modeling and discovering vulnerabilities with code property graphs
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE Symposium on Security and Privacy, pages 590–604. IEEE, 2014
work page 2014
-
[31]
WhiteFox: White-box compiler fuzzing empowered by large language models
Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Ji- awei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. WhiteFox: White-box compiler fuzzing empowered by large language models. InProceedings of the ACM on Programming Languages, volume 8, pages 1–27. ACM, 2024
work page 2024
-
[32]
Ker- nelGPT: Enhanced kernel fuzzing via large language models
Chenyuan Yang, Zijie Zou, and Lingming Zhang. Ker- nelGPT: Enhanced kernel fuzzing via large language models. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Anal- ysis (ISSTA). ACM, 2024
work page 2024
-
[33]
American Fuzzy Lop: A security- oriented fuzzer
Michał Zalewski. American Fuzzy Lop: A security- oriented fuzzer. Google, 2014. https://lcamtuf.co redump.cx/afl/
work page 2014
-
[34]
Vul-RAG: Enhancing LLM-based vulnerability detection via knowledge-level RAG
Xueying Zhang, Jiongyi Zhang, Zhengyang Su, Yan- lin Chen, Shiyu Xu, Zhi Lin, Lianxiao Tan, Yichi Guo, Yuqun Gu, and Shuiguang Deng. Vul-RAG: Enhancing LLM-based vulnerability detection via knowledge-level RAG. InarXiv preprint arXiv:2406.11147, 2024
-
[35]
Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodol- ogy, 34:1–31, 2024
work page 2024
-
[36]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identifi- cation by learning comprehensive program semantics via graph neural networks. InAdvances in Neural Infor- mation Processing Systems, volume 32, 2019. A Agent Prompts 16 Direction Generator Prompt (Abridged) You are a security architect analyzing a...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.