pith. machine review for the scientific record. sign in

arxiv: 2604.06618 · v2 · submitted 2026-04-08 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.CR
keywords PoC generationsemantic oraclevulnerability reproductionLLM agentsadaptive policyexploit verificationCVEsecurity automation
0
0 comments X

The pith

PoC-Adapt uses semantic state comparison and adaptive policies to boost PoC exploit verification reliability by 25%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PoC-Adapt is designed to automate the generation and reliable verification of proof-of-concept exploits for reported vulnerabilities using large language models in a multi-agent setup. The key innovation is a Semantic Oracle that compares structured system states before and after exploit execution to confirm that the vulnerability was actually triggered. This overcomes the unreliability of surface-level checks in previous systems. Additionally, an Adaptive Policy Learning component learns effective exploitation strategies to reduce the expense of trial-and-error. A sympathetic reader would care because this could make it more practical for security professionals to reproduce and confirm vulnerabilities automatically.

Core claim

At its core, PoC-Adapt employs a Semantic Oracle for validation through pre- and post-execution state comparisons and an Adaptive Policy Learning mechanism based on reinforcement learning to guide exploit generation. Implemented with specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, the system demonstrates a 25% improvement in verification reliability on benchmarks and successfully generates verified PoCs for recent CVEs at an average cost of $0.42 per attempt.

What carries the argument

The Semantic Oracle validates exploits by comparing structured pre- and post-execution system states to distinguish true exploitation from incidental changes, supported by Adaptive Policy Learning that learns an exploitation policy over semantic states and actions to minimize failed attempts.

If this is right

  • Verification reliability improves by 25% over prior LLM-based PoC systems on benchmarks.
  • Exploit generation cost decreases through the use of learned adaptive policies.
  • The framework can process recent CVE reports and confirm a subset of exploits at low per-unit cost.
  • Multi-agent coordination with feedback loops enables end-to-end automation of vulnerability reproduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending semantic validation to other automated security tasks could improve accuracy in areas like malware analysis or intrusion detection.
  • The approach might allow for more efficient use of LLM resources in large-scale vulnerability scanning efforts.
  • If the policy learning generalizes well, it could adapt to new vulnerability types without extensive retraining.

Load-bearing premise

Comparing structured pre- and post-execution system states reliably distinguishes true vulnerability exploitation from incidental behavioral changes in diverse programs and environments.

What would settle it

Observing an instance where PoC-Adapt reports a successful exploit validation via state changes but independent analysis confirms the reported vulnerability was not triggered by the PoC.

Figures

Figures reproduced from arXiv: 2604.06618 by Khoa Ngo-Khanh, Nguyen Huu Quyen, Phan The Duy, Van-Hau Pham.

Figure 1
Figure 1. Figure 1: Overall architecture of PoC-Adapt and data processing flow. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Self-verification mechanism. reduce exploration cost without any human-in-the-loop intervention. Self-verification, inspired by Reflexion [23], enables each agent (particularly the Exploiter) to internally critique and iteratively refine its own outputs—such as PoC candi￾dates and associated hypotheses—against the constraints and taint paths identified in the RCA report. This intra￾agent reflection loop bo… view at source ↗
Figure 3
Figure 3. Figure 3: details the deployment of our proposed Policy Learn￾ing Mechanism within the generation pipeline. At each step, rather than letting the LLM think unconstrained, the RL layer evaluates the current state and proposes the optimal macro-action/tool. The LLM is then restricted to executing that specific tool with appropriate parameters, effectively bounding the search space and ensuring discipline during code g… view at source ↗
Figure 4
Figure 4. Figure 4: Policy model training pipeline. 59 episodes) at the CVE level to prevent leakage. CWE distri￾bution for RL data is in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Detailed stage-wise analysis of PoC-Adapt on GHSA-Real80. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: shows highest SR for CWE-78 (Command Injec￾tion, 23.1%) and CWE-79 (XSS, 22.2%), with CWE-22 (Path Traversal) at 16.7%. CWE-502 (Deserialization) yields 0%, in￾dicating limitations in handling context-dependent vulnerabil￾ities. This suggests PoC-Adapt excels on direct-impact web vulnerabilities but struggles with subtle, logic-based ones [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

While recent approaches leverage large language models (LLMs) and multi-agent pipelines to automatically generate proof-of-concept (PoC) exploits from vulnerability reports, existing systems often suffer from two fundamental limitations: unreliable validation based on surface-level execution signals and high operational cost caused by extensive trial-and-error during exploit generation. In this paper, we present PoC-Adapt, an end-to-end framework for automated PoC generation and verification, architected upon a foundation semantic runtime validation and adaptive policy learning. At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states, enabling reliable distinction between true vulnerability exploitation and incidental behavioral changes. To reduce exploration cost, we further introduce an Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions, guiding the exploit agent toward effective strategies with fewer failed attempts. PoC-Adapt is implemented as a multi-agent system comprising specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, coordinated through structured feedback loops. Experimenting on the CWE-Bench-Java and PrimeVul benchmarks shows that PoC-Adapt significantly improves verification reliability by 25% and reduces exploit generation cost compared to prior LLM-based systems, highlighting the importance of semantic validation and learned action policies in automated vulnerability reproduction. Applied to the latest CVE corpus, PoC-Adapt confirmed 12 verified PoC out of 80 reproduce attempts at a cost of $0.42 per generated exploit

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The paper introduces PoC-Adapt, a multi-agent LLM framework for automated PoC exploit generation and verification. It centers on a Semantic Oracle that validates exploits via structured pre- and post-execution system state comparisons and an Adaptive Policy Learning mechanism (RL-driven) that learns exploitation policies to reduce trial-and-error. The system comprises agents for root cause analysis, environment building, exploit generation, and semantic validation. On CWE-Bench-Java and PrimeVul, it claims a 25% reliability improvement and lower costs versus prior LLM systems; on 80 recent CVEs it reports 12 verified PoCs at $0.42 per exploit.

Significance. If the evaluation claims hold under rigorous protocols, the work would advance automated vulnerability reproduction by replacing surface-level signals with semantic state validation and by learning cost-reducing policies. This directly targets two persistent bottlenecks in LLM-based exploit generation and could improve reproducibility in security research and red-teaming pipelines.

major comments (4)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The 25% reliability improvement is asserted without defining the reliability metric, the precise baseline systems (including their configurations), statistical tests, error bars, or number of runs. This makes the central performance claim impossible to verify or reproduce.
  2. [Semantic Oracle / Methodology] Semantic Oracle description: The oracle is defined only at a high level via multi-agent feedback loops; no formal specification of captured state elements (heap objects, file descriptors, registers, etc.), comparison procedure, or handling of non-determinism is provided. No ablation isolates oracle accuracy from the adaptive policy, and no experiments on programs with logging, GC, or scheduler noise are reported.
  3. [CVE Evaluation] CVE corpus experiment: The 12/80 confirmation rate and $0.42 cost figure lack details on CVE selection criteria, definition of 'verified PoC', environment standardization, or direct comparison against the same baselines used in the benchmark experiments.
  4. [Adaptive Policy Learning] Adaptive Policy Learning: The RL component is introduced without specifying the state/action representation, reward function, learning algorithm, or convergence criteria, preventing assessment of whether the claimed cost reduction stems from policy learning or other factors.
minor comments (1)
  1. [Abstract / Introduction] The abstract and introduction repeat the same high-level claims without adding concrete implementation or evaluation details; a dedicated 'Limitations' or 'Threats to Validity' subsection would improve clarity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional rigor and specification will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested clarifications, definitions, and experiments.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The 25% reliability improvement is asserted without defining the reliability metric, the precise baseline systems (including their configurations), statistical tests, error bars, or number of runs. This makes the central performance claim impossible to verify or reproduce.

    Authors: We agree that the reliability metric and supporting evaluation details must be defined explicitly for reproducibility. In the revised manuscript we will define reliability as the fraction of generation attempts that produce a PoC passing semantic validation. We will name the exact baseline systems and their configurations, report results over multiple independent runs with error bars, and include appropriate statistical tests (e.g., paired t-tests). These additions will appear in the Evaluation section and be summarized in the Abstract. revision: yes

  2. Referee: [Semantic Oracle / Methodology] Semantic Oracle description: The oracle is defined only at a high level via multi-agent feedback loops; no formal specification of captured state elements (heap objects, file descriptors, registers, etc.), comparison procedure, or handling of non-determinism is provided. No ablation isolates oracle accuracy from the adaptive policy, and no experiments on programs with logging, GC, or scheduler noise are reported.

    Authors: We acknowledge that the current description of the Semantic Oracle is high-level. The revision will supply a formal specification of the captured state elements (heap objects, file descriptors, registers, and other runtime artifacts), the structured comparison procedure, and the method for handling non-determinism (repeated executions with threshold-based matching). We will also add an ablation isolating the oracle from the adaptive policy and new experiments evaluating robustness under logging, garbage-collection, and scheduler noise. revision: yes

  3. Referee: [CVE Evaluation] CVE corpus experiment: The 12/80 confirmation rate and $0.42 cost figure lack details on CVE selection criteria, definition of 'verified PoC', environment standardization, or direct comparison against the same baselines used in the benchmark experiments.

    Authors: We agree these experimental details are necessary. The revised manuscript will state the CVE selection criteria (recent CVEs with public reports and reproducible environments), define a verified PoC as one that passes the Semantic Oracle and produces the intended effect, describe the standardized containerized environment, and present direct head-to-head comparisons against the benchmark baselines on the CVE corpus. revision: yes

  4. Referee: [Adaptive Policy Learning] Adaptive Policy Learning: The RL component is introduced without specifying the state/action representation, reward function, learning algorithm, or convergence criteria, preventing assessment of whether the claimed cost reduction stems from policy learning or other factors.

    Authors: We recognize that the RL mechanism requires precise specification. The revision will detail the state representation (semantic vectors from the Oracle), action space (exploitation primitives), reward function (success bonus minus per-step cost), learning algorithm, and convergence criteria. These additions will clarify the source of the observed cost reduction. revision: yes

Circularity Check

0 steps flagged

No circularity in PoC-Adapt derivation chain

full rationale

The paper defines the Semantic Oracle via pre/post state comparison and the Adaptive Policy via RL over semantic states/actions as independent architectural components. Reported gains (25% reliability, 12/80 CVE confirmations, $0.42 cost) are presented as empirical outcomes on external benchmarks (CWE-Bench-Java, PrimeVul, CVE corpus) rather than reductions to fitted inputs or self-referential definitions. No equations, self-citations, or uniqueness theorems are invoked that collapse the central claims back to their own inputs by construction. The derivation remains self-contained against observable execution states.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on two newly introduced components whose effectiveness is asserted but not independently evidenced in the abstract. Standard assumptions about LLM capabilities in code-related tasks are implicit. No numeric free parameters are identifiable from the given text.

axioms (1)
  • domain assumption Large language models can be orchestrated in multi-agent pipelines to perform root cause analysis, environment setup, and exploit code generation from vulnerability reports
    This is presupposed by the entire multi-agent architecture described in the abstract.
invented entities (2)
  • Semantic Oracle no independent evidence
    purpose: Validates whether an exploit truly triggered the vulnerability by comparing structured pre- and post-execution system states
    Core novel component introduced to replace surface-level execution signals; no prior literature or external validation cited for this exact mechanism.
  • Adaptive Policy Learning mechanism no independent evidence
    purpose: Learns an exploitation policy over semantic states and actions to reduce failed attempts during exploit generation
    Introduced to lower operational cost via reinforcement learning; details of state representation, reward function, and training are absent from the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1579 out tokens · 82387 ms · 2026-05-10T18:16:41.797431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    2025 q4 vulnerability publication forecast.FIRST Blog, October 2025

    ´Eireann Leverett. 2025 q4 vulnerability publication forecast.FIRST Blog, October 2025. Accessed: 2 December 2025

  2. [2]

    2025 data breach investigations report (dbir).Verizon Business, 2025

    Verizon. 2025 data breach investigations report (dbir).Verizon Business, 2025

  3. [3]

    Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems

    Teresa Nicole Brooks. Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems. InScience and Information Conference, pages 1083–1102. Springer, 2018

  4. [4]

    Survey of methods for automated code-reuse exploit generation.Programming and Com- puter Software, 47(4):271–297, 2021

    Alexey V Vishnyakov and Alexey R Nurmukhametov. Survey of methods for automated code-reuse exploit generation.Programming and Com- puter Software, 47(4):271–297, 2021

  5. [5]

    Scaling security testing by addressing the reachability gap

    Gaetano Sapia and Marcel B ¨ohme. Scaling security testing by addressing the reachability gap. InInternational Conference on Software Engineer- ing (ICSE), 2026

  6. [6]

    A systematic study on generating web vulnerability proof-of-concepts using large language models.arXiv preprint arXiv:2510.10148, 2025

    Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, and Zheli Liu. A systematic study on generating web vulnerability proof-of-concepts using large language models.arXiv preprint arXiv:2510.10148, 2025

  7. [7]

    Au- tomatic patch-based exploit generation is possible: Techniques and impli- cations

    David Brumley, Pongsin Poosankam, Dawn Song, and Jiang Zheng. Au- tomatic patch-based exploit generation is possible: Techniques and impli- cations. InProceedings of the IEEE Symposium on Security and Privacy (SP), April 2008. Accessed: 16 December 2025

  8. [8]

    Aeg: Automatic exploit generation

    Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. Aeg: Automatic exploit generation. InProceedings of the Network and Distributed System Security Symposium (NDSS), February

  9. [9]

    Accessed: 16 December 2025

  10. [10]

    Chainsaw: Chained automated workflow-based exploit generation

    Abeer Alhuzali, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrish- nan. Chainsaw: Chained automated workflow-based exploit generation. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 641–652, 2016

  11. [11]

    ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024

    Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doup´e, Hammond Pearce, Brendan Dolan-Gavitt, et al. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153, 2024

  12. [12]

    Simsek, A

    Deniz Simsek, Aryaz Eghbali, and Michael Pradel. Pocgen: Generat- ing proof-of-concept exploits for vulnerabilities in npm packages.arXiv preprint arXiv:2506.04962, 2025

  13. [13]

    Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduc- tion, February 2026

    Juefei Pu, Xingyu Li, Haonan Li, Zhengchuan Liang, Jonathan Cox, Yi- fan Wu, Kareem Shehada, Arrdya Srivastav, and Zhiyun Qian. Patch-to- poc: A systematic study of agentic llm systems for linux kernel n-day reproduction.arXiv preprint arXiv:2602.07287, 2026

  14. [14]

    Poco: Agentic proof-of-concept exploit generation for smart con- tracts.arXiv preprint arXiv:2511.02780, 2025

    Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, and Martin Mon- perrus. Poco: Agentic proof-of-concept exploit generation for smart con- tracts.arXiv preprint arXiv:2511.02780, 2025. 15

  15. [15]

    Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing

    Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Se- curity 24), pages 847–864, Philadelphia, PA, August 2024. USENIX As- sociation

  16. [16]

    xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

    Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, and Phan The Duy. xoffense: An ai-driven autonomous penetration testing framework with offensive knowledge-enhanced llms and multi agent systems.arXiv preprint arXiv:2509.13021, 2025

  17. [17]

    What makes a good llm agent for real- world penetration testing?arXiv preprint arXiv:2602.17622, 2026

    Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real- world penetration testing?arXiv preprint arXiv:2602.17622, 2026

  18. [18]

    Ptfusion: Llm-driven context-aware knowledge fusion for web penetration testing.Information Fusion, page 103731, 2025

    Wenhao Wang, Hao Gu, Zhixuan Wu, Hao Chen, Xingguo Chen, and Fan Shi. Ptfusion: Llm-driven context-aware knowledge fusion for web penetration testing.Information Fusion, page 103731, 2025

  19. [19]

    Faultline: Automated proof-of-vulnerability generation using llm agents

    Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. Faultline: Automated proof-of-vulnerability generation using llm agents. arXiv preprint, 2025

  20. [20]

    From cve entries to verifiable exploits: An automated multi- agent framework for reproducing cves.arXiv preprint, 2025

    Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. From cve entries to verifiable exploits: An automated multi- agent framework for reproducing cves.arXiv preprint, 2025

  21. [21]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

  22. [22]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

  23. [23]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  24. [24]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1, 2023

  25. [25]

    Pentestgpt: An llm- empowered automatic penetration testing tool

    Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentest- gpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023

  26. [26]

    Vulagent: Hypothesis- validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025

    Ziliang Wang, Ge Li, Jia Li, Hao Zhu, and Zhi Jin. Vulagent: Hypothesis- validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025

  27. [27]

    Smartpoc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

    Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, and Chao Zhang. Smartpoc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

  28. [28]

    MIT Press, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2018

  29. [29]

    Offline rein- forcement learning: Tutorial, review, and perspectives on open problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint, 2020

  30. [30]

    Deep reinforcement learning with double q-learning

    Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), 2016. Accessed: 21 De- cember 2025

  31. [31]

    Semfuzz: Semantics-based automatic generation of proof-of-concept exploits

    Wei You, Peiyuan Zong, Kai Chen, XiaoFeng Wang, Xiaojing Liao, Pan Bian, and Bin Liang. Semfuzz: Semantics-based automatic generation of proof-of-concept exploits. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17), pages 2139–2154, October 2017. Accessed: 16 December 2025

  32. [32]

    Depa: Determining ex- ploit primitives automatically for interactive programs

    Jie Liu, Hang An, Jin Li, and Hongliang Liang. Depa: Determining ex- ploit primitives automatically for interactive programs. InProceedings of the 2022 6th International Conference on Electronic Information Tech- nology and Computer Engineering, pages 690–694, 2022

  33. [33]

    Automatic generation of{Data-Oriented}exploits

    Hong Hu, Zheng Leong Chua, Sendroiu Adrian, Prateek Saxena, and Zhenkai Liang. Automatic generation of{Data-Oriented}exploits. In 24th USENIX Security Symposium (USENIX Security 15), pages 177– 192, 2015

  34. [34]

    Nothing is unreachable: Automated synthesis of robust{Code-Reuse} gadget chains for arbitrary exploitation primitives

    Nicolas Bailluet, Emmanuel Fleury, Isabelle Puaut, and Erven Rohou. Nothing is unreachable: Automated synthesis of robust{Code-Reuse} gadget chains for arbitrary exploitation primitives. In34th USENIX Se- curity Symposium (USENIX Security 25), pages 625–643, 2025

  35. [35]

    In 34th USENIX Security Symposium (USENIX Security 25), pages 8313– 8327, 2025

    Neil P Thimmaiah, Yashashvi J Dave, Rigel Gjomemo, and VN Venkatakrishnan.{FIXX}:{FInding}{eXploits}from{eXamples}. In 34th USENIX Security Symposium (USENIX Security 25), pages 8313– 8327, 2025

  36. [36]

    Saeg: Stateful automatic exploit generation

    Yifan Wu, Yinshuai Li, Hong Zhu, and Yinqian Zhang. Saeg: Stateful automatic exploit generation. InEuropean Symposium on Research in Computer Security, pages 127–145. Springer, 2024

  37. [37]

    Revealing the exploitability of heap overflow through poc analysis.Cybersecurity, 7(1):47, 2024

    Qintao Shen, Guozhu Meng, and Kai Chen. Revealing the exploitability of heap overflow through poc analysis.Cybersecurity, 7(1):47, 2024

  38. [38]

    A systematic literature re- view on automated exploit and security test generation.arXiv preprint arXiv:2502.04953, 2025

    Quang-Cuong Bui, Emanuele Iannone, Maria Camporese, Torge Hin- richs, Catherine Tony, L ´aszl´o T ´oth, Fabio Palomba, P ´eter Heged ˝us, Fabio Massacci, and Riccardo Scandariato. A systematic literature re- view on automated exploit and security test generation.arXiv preprint arXiv:2502.04953, 2025

  39. [39]

    A chit-chat between llama 2 and chatgpt for the automated cre- ation of exploits.Computer Networks, 270:111501, 2025

    Francesco Caturano, Jordan Ciotola, Simon Pietro Romano, and Mario Varlese. A chit-chat between llama 2 and chatgpt for the automated cre- ation of exploits.Computer Networks, 270:111501, 2025

  40. [40]

    Deepattacker: Multi-agents collaboration based breach and attack simulation

    Qingyu Qu, Peiyu Liu, Hanshu Zhao, Qiang Yang, Adil Israr, and Wei Ruan. Deepattacker: Multi-agents collaboration based breach and attack simulation. In2025 2nd International Symposium on AI and Cybersecu- rity (ISAICS), pages 1–5. IEEE, 2025

  41. [41]

    and Nadiri, A

    Yashar Talebirad and Amirhossein Nadiri. Multi-agent collabora- tion: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

  42. [42]

    Iris: LLM-assisted static analysis for detecting security vulnerabil- ities.arXiv preprint, 2024

    Yibin Chen, Linyi Zhang, Yizheng Li, James Zou, Percy Li, and Chelsea Finn. Iris: LLM-assisted static analysis for detecting security vulnerabil- ities.arXiv preprint, 2024

  43. [43]

    https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

  44. [44]

    Github advisory database.GitHub Security Advisories, 2025

    GitHub. Github advisory database.GitHub Security Advisories, 2025. 16