arxiv: 2604.06618 · v2 · submitted 2026-04-08 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy

Phan The Duy , Khoa Ngo-Khanh , Nguyen Huu Quyen , Van-Hau Pham

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.CR

keywords PoC generationsemantic oraclevulnerability reproductionLLM agentsadaptive policyexploit verificationCVEsecurity automation

0 comments

The pith

PoC-Adapt uses semantic state comparison and adaptive policies to boost PoC exploit verification reliability by 25%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PoC-Adapt is designed to automate the generation and reliable verification of proof-of-concept exploits for reported vulnerabilities using large language models in a multi-agent setup. The key innovation is a Semantic Oracle that compares structured system states before and after exploit execution to confirm that the vulnerability was actually triggered. This overcomes the unreliability of surface-level checks in previous systems. Additionally, an Adaptive Policy Learning component learns effective exploitation strategies to reduce the expense of trial-and-error. A sympathetic reader would care because this could make it more practical for security professionals to reproduce and confirm vulnerabilities automatically.

Core claim

At its core, PoC-Adapt employs a Semantic Oracle for validation through pre- and post-execution state comparisons and an Adaptive Policy Learning mechanism based on reinforcement learning to guide exploit generation. Implemented with specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, the system demonstrates a 25% improvement in verification reliability on benchmarks and successfully generates verified PoCs for recent CVEs at an average cost of $0.42 per attempt.

What carries the argument

The Semantic Oracle validates exploits by comparing structured pre- and post-execution system states to distinguish true exploitation from incidental changes, supported by Adaptive Policy Learning that learns an exploitation policy over semantic states and actions to minimize failed attempts.

If this is right

Verification reliability improves by 25% over prior LLM-based PoC systems on benchmarks.
Exploit generation cost decreases through the use of learned adaptive policies.
The framework can process recent CVE reports and confirm a subset of exploits at low per-unit cost.
Multi-agent coordination with feedback loops enables end-to-end automation of vulnerability reproduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending semantic validation to other automated security tasks could improve accuracy in areas like malware analysis or intrusion detection.
The approach might allow for more efficient use of LLM resources in large-scale vulnerability scanning efforts.
If the policy learning generalizes well, it could adapt to new vulnerability types without extensive retraining.

Load-bearing premise

Comparing structured pre- and post-execution system states reliably distinguishes true vulnerability exploitation from incidental behavioral changes in diverse programs and environments.

What would settle it

Observing an instance where PoC-Adapt reports a successful exploit validation via state changes but independent analysis confirms the reported vulnerability was not triggered by the PoC.

Figures

Figures reproduced from arXiv: 2604.06618 by Khoa Ngo-Khanh, Nguyen Huu Quyen, Phan The Duy, Van-Hau Pham.

**Figure 2.** Figure 2: Self-verification mechanism. reduce exploration cost without any human-in-the-loop intervention. Self-verification, inspired by Reflexion [23], enables each agent (particularly the Exploiter) to internally critique and iteratively refine its own outputs—such as PoC candidates and associated hypotheses—against the constraints and taint paths identified in the RCA report. This intraagent reflection loop bo… view at source ↗

**Figure 3.** Figure 3: details the deployment of our proposed Policy Learning Mechanism within the generation pipeline. At each step, rather than letting the LLM think unconstrained, the RL layer evaluates the current state and proposes the optimal macro-action/tool. The LLM is then restricted to executing that specific tool with appropriate parameters, effectively bounding the search space and ensuring discipline during code g… view at source ↗

**Figure 4.** Figure 4: Policy model training pipeline. 59 episodes) at the CVE level to prevent leakage. CWE distribution for RL data is in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Detailed stage-wise analysis of PoC-Adapt on GHSA-Real80. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: shows highest SR for CWE-78 (Command Injection, 23.1%) and CWE-79 (XSS, 22.2%), with CWE-22 (Path Traversal) at 16.7%. CWE-502 (Deserialization) yields 0%, indicating limitations in handling context-dependent vulnerabilities. This suggests PoC-Adapt excels on direct-impact web vulnerabilities but struggles with subtle, logic-based ones [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

While recent approaches leverage large language models (LLMs) and multi-agent pipelines to automatically generate proof-of-concept (PoC) exploits from vulnerability reports, existing systems often suffer from two fundamental limitations: unreliable validation based on surface-level execution signals and high operational cost caused by extensive trial-and-error during exploit generation. In this paper, we present PoC-Adapt, an end-to-end framework for automated PoC generation and verification, architected upon a foundation semantic runtime validation and adaptive policy learning. At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states, enabling reliable distinction between true vulnerability exploitation and incidental behavioral changes. To reduce exploration cost, we further introduce an Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions, guiding the exploit agent toward effective strategies with fewer failed attempts. PoC-Adapt is implemented as a multi-agent system comprising specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, coordinated through structured feedback loops. Experimenting on the CWE-Bench-Java and PrimeVul benchmarks shows that PoC-Adapt significantly improves verification reliability by 25% and reduces exploit generation cost compared to prior LLM-based systems, highlighting the importance of semantic validation and learned action policies in automated vulnerability reproduction. Applied to the latest CVE corpus, PoC-Adapt confirmed 12 verified PoC out of 80 reproduce attempts at a cost of $0.42 per generated exploit

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PoC-Adapt adds a semantic state oracle and RL-driven policy to multi-agent LLM PoC generation, but the 25% reliability claim and 12/80 CVE results sit on an abstract with no visible experiments or robustness checks.

read the letter

The paper's core move is replacing surface-level crash signals with a Semantic Oracle that diffs structured pre- and post-execution states, then wrapping that in an RL policy learner so the exploit agent wastes fewer attempts. The multi-agent split (root-cause, build, generate, validate) with feedback loops is a clean way to organize the pipeline and directly targets the two problems called out in prior LLM PoC work: flaky validation and high trial-and-error cost. If the oracle actually separates real exploitation from incidental changes, that would be a practical step forward for anyone who needs reproducible exploits on real CVEs at low cost ($0.42 per attempt is cited on the latest corpus).

Referee Report

4 major / 1 minor

Summary. The paper introduces PoC-Adapt, a multi-agent LLM framework for automated PoC exploit generation and verification. It centers on a Semantic Oracle that validates exploits via structured pre- and post-execution system state comparisons and an Adaptive Policy Learning mechanism (RL-driven) that learns exploitation policies to reduce trial-and-error. The system comprises agents for root cause analysis, environment building, exploit generation, and semantic validation. On CWE-Bench-Java and PrimeVul, it claims a 25% reliability improvement and lower costs versus prior LLM systems; on 80 recent CVEs it reports 12 verified PoCs at $0.42 per exploit.

Significance. If the evaluation claims hold under rigorous protocols, the work would advance automated vulnerability reproduction by replacing surface-level signals with semantic state validation and by learning cost-reducing policies. This directly targets two persistent bottlenecks in LLM-based exploit generation and could improve reproducibility in security research and red-teaming pipelines.

major comments (4)

[Abstract / Evaluation] Abstract and Evaluation section: The 25% reliability improvement is asserted without defining the reliability metric, the precise baseline systems (including their configurations), statistical tests, error bars, or number of runs. This makes the central performance claim impossible to verify or reproduce.
[Semantic Oracle / Methodology] Semantic Oracle description: The oracle is defined only at a high level via multi-agent feedback loops; no formal specification of captured state elements (heap objects, file descriptors, registers, etc.), comparison procedure, or handling of non-determinism is provided. No ablation isolates oracle accuracy from the adaptive policy, and no experiments on programs with logging, GC, or scheduler noise are reported.
[CVE Evaluation] CVE corpus experiment: The 12/80 confirmation rate and $0.42 cost figure lack details on CVE selection criteria, definition of 'verified PoC', environment standardization, or direct comparison against the same baselines used in the benchmark experiments.
[Adaptive Policy Learning] Adaptive Policy Learning: The RL component is introduced without specifying the state/action representation, reward function, learning algorithm, or convergence criteria, preventing assessment of whether the claimed cost reduction stems from policy learning or other factors.

minor comments (1)

[Abstract / Introduction] The abstract and introduction repeat the same high-level claims without adding concrete implementation or evaluation details; a dedicated 'Limitations' or 'Threats to Validity' subsection would improve clarity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional rigor and specification will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested clarifications, definitions, and experiments.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The 25% reliability improvement is asserted without defining the reliability metric, the precise baseline systems (including their configurations), statistical tests, error bars, or number of runs. This makes the central performance claim impossible to verify or reproduce.

Authors: We agree that the reliability metric and supporting evaluation details must be defined explicitly for reproducibility. In the revised manuscript we will define reliability as the fraction of generation attempts that produce a PoC passing semantic validation. We will name the exact baseline systems and their configurations, report results over multiple independent runs with error bars, and include appropriate statistical tests (e.g., paired t-tests). These additions will appear in the Evaluation section and be summarized in the Abstract. revision: yes
Referee: [Semantic Oracle / Methodology] Semantic Oracle description: The oracle is defined only at a high level via multi-agent feedback loops; no formal specification of captured state elements (heap objects, file descriptors, registers, etc.), comparison procedure, or handling of non-determinism is provided. No ablation isolates oracle accuracy from the adaptive policy, and no experiments on programs with logging, GC, or scheduler noise are reported.

Authors: We acknowledge that the current description of the Semantic Oracle is high-level. The revision will supply a formal specification of the captured state elements (heap objects, file descriptors, registers, and other runtime artifacts), the structured comparison procedure, and the method for handling non-determinism (repeated executions with threshold-based matching). We will also add an ablation isolating the oracle from the adaptive policy and new experiments evaluating robustness under logging, garbage-collection, and scheduler noise. revision: yes
Referee: [CVE Evaluation] CVE corpus experiment: The 12/80 confirmation rate and $0.42 cost figure lack details on CVE selection criteria, definition of 'verified PoC', environment standardization, or direct comparison against the same baselines used in the benchmark experiments.

Authors: We agree these experimental details are necessary. The revised manuscript will state the CVE selection criteria (recent CVEs with public reports and reproducible environments), define a verified PoC as one that passes the Semantic Oracle and produces the intended effect, describe the standardized containerized environment, and present direct head-to-head comparisons against the benchmark baselines on the CVE corpus. revision: yes
Referee: [Adaptive Policy Learning] Adaptive Policy Learning: The RL component is introduced without specifying the state/action representation, reward function, learning algorithm, or convergence criteria, preventing assessment of whether the claimed cost reduction stems from policy learning or other factors.

Authors: We recognize that the RL mechanism requires precise specification. The revision will detail the state representation (semantic vectors from the Oracle), action space (exploitation primitives), reward function (success bonus minus per-step cost), learning algorithm, and convergence criteria. These additions will clarify the source of the observed cost reduction. revision: yes

Circularity Check

0 steps flagged

No circularity in PoC-Adapt derivation chain

full rationale

The paper defines the Semantic Oracle via pre/post state comparison and the Adaptive Policy via RL over semantic states/actions as independent architectural components. Reported gains (25% reliability, 12/80 CVE confirmations, $0.42 cost) are presented as empirical outcomes on external benchmarks (CWE-Bench-Java, PrimeVul, CVE corpus) rather than reductions to fitted inputs or self-referential definitions. No equations, self-citations, or uniqueness theorems are invoked that collapse the central claims back to their own inputs by construction. The derivation remains self-contained against observable execution states.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on two newly introduced components whose effectiveness is asserted but not independently evidenced in the abstract. Standard assumptions about LLM capabilities in code-related tasks are implicit. No numeric free parameters are identifiable from the given text.

axioms (1)

domain assumption Large language models can be orchestrated in multi-agent pipelines to perform root cause analysis, environment setup, and exploit code generation from vulnerability reports
This is presupposed by the entire multi-agent architecture described in the abstract.

invented entities (2)

Semantic Oracle no independent evidence
purpose: Validates whether an exploit truly triggered the vulnerability by comparing structured pre- and post-execution system states
Core novel component introduced to replace surface-level execution signals; no prior literature or external validation cited for this exact mechanism.
Adaptive Policy Learning mechanism no independent evidence
purpose: Learns an exploitation policy over semantic states and actions to reduce failed attempts during exploit generation
Introduced to lower operational cost via reinforcement learning; details of state representation, reward function, and training are absent from the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1579 out tokens · 82387 ms · 2026-05-10T18:16:41.797431+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states... Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions... DDQN
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formulates exploit generation as a MDP and trains a DDQN offline using exploitation logs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 14 canonical work pages · 2 internal anchors

[1]

2025 q4 vulnerability publication forecast.FIRST Blog, October 2025

´Eireann Leverett. 2025 q4 vulnerability publication forecast.FIRST Blog, October 2025. Accessed: 2 December 2025

2025
[2]

2025 data breach investigations report (dbir).Verizon Business, 2025

Verizon. 2025 data breach investigations report (dbir).Verizon Business, 2025

2025
[3]

Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems

Teresa Nicole Brooks. Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems. InScience and Information Conference, pages 1083–1102. Springer, 2018

2018
[4]

Survey of methods for automated code-reuse exploit generation.Programming and Com- puter Software, 47(4):271–297, 2021

Alexey V Vishnyakov and Alexey R Nurmukhametov. Survey of methods for automated code-reuse exploit generation.Programming and Com- puter Software, 47(4):271–297, 2021

2021
[5]

Scaling security testing by addressing the reachability gap

Gaetano Sapia and Marcel B ¨ohme. Scaling security testing by addressing the reachability gap. InInternational Conference on Software Engineer- ing (ICSE), 2026

2026
[6]

A systematic study on generating web vulnerability proof-of-concepts using large language models.arXiv preprint arXiv:2510.10148, 2025

Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, and Zheli Liu. A systematic study on generating web vulnerability proof-of-concepts using large language models.arXiv preprint arXiv:2510.10148, 2025

work page arXiv 2025
[7]

Au- tomatic patch-based exploit generation is possible: Techniques and impli- cations

David Brumley, Pongsin Poosankam, Dawn Song, and Jiang Zheng. Au- tomatic patch-based exploit generation is possible: Techniques and impli- cations. InProceedings of the IEEE Symposium on Security and Privacy (SP), April 2008. Accessed: 16 December 2025

2008
[8]

Aeg: Automatic exploit generation

Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. Aeg: Automatic exploit generation. InProceedings of the Network and Distributed System Security Symposium (NDSS), February
[9]

Accessed: 16 December 2025

2025
[10]

Chainsaw: Chained automated workflow-based exploit generation

Abeer Alhuzali, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrish- nan. Chainsaw: Chained automated workflow-based exploit generation. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 641–652, 2016

2016
[11]

ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024

Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doup´e, Hammond Pearce, Brendan Dolan-Gavitt, et al. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153, 2024

work page arXiv 2024
[12]

Simsek, A

Deniz Simsek, Aryaz Eghbali, and Michael Pradel. Pocgen: Generat- ing proof-of-concept exploits for vulnerabilities in npm packages.arXiv preprint arXiv:2506.04962, 2025

work page arXiv 2025
[13]

Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduc- tion, February 2026

Juefei Pu, Xingyu Li, Haonan Li, Zhengchuan Liang, Jonathan Cox, Yi- fan Wu, Kareem Shehada, Arrdya Srivastav, and Zhiyun Qian. Patch-to- poc: A systematic study of agentic llm systems for linux kernel n-day reproduction.arXiv preprint arXiv:2602.07287, 2026

work page arXiv 2026
[14]

Poco: Agentic proof-of-concept exploit generation for smart con- tracts.arXiv preprint arXiv:2511.02780, 2025

Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, and Martin Mon- perrus. Poco: Agentic proof-of-concept exploit generation for smart con- tracts.arXiv preprint arXiv:2511.02780, 2025. 15

work page arXiv 2025
[15]

Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Se- curity 24), pages 847–864, Philadelphia, PA, August 2024. USENIX As- sociation

2024
[16]

xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, and Phan The Duy. xoffense: An ai-driven autonomous penetration testing framework with offensive knowledge-enhanced llms and multi agent systems.arXiv preprint arXiv:2509.13021, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

What makes a good llm agent for real- world penetration testing?arXiv preprint arXiv:2602.17622, 2026

Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real- world penetration testing?arXiv preprint arXiv:2602.17622, 2026

work page arXiv 2026
[18]

Ptfusion: Llm-driven context-aware knowledge fusion for web penetration testing.Information Fusion, page 103731, 2025

Wenhao Wang, Hao Gu, Zhixuan Wu, Hao Chen, Xingguo Chen, and Fan Shi. Ptfusion: Llm-driven context-aware knowledge fusion for web penetration testing.Information Fusion, page 103731, 2025

2025
[19]

Faultline: Automated proof-of-vulnerability generation using llm agents

Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. Faultline: Automated proof-of-vulnerability generation using llm agents. arXiv preprint, 2025

2025
[20]

From cve entries to verifiable exploits: An automated multi- agent framework for reproducing cves.arXiv preprint, 2025

Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. From cve entries to verifiable exploits: An automated multi- agent framework for reproducing cves.arXiv preprint, 2025

2025
[21]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

2023
[22]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

2023
[23]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023

2023
[24]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Pentestgpt: An llm- empowered automatic penetration testing tool

Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentest- gpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023

work page arXiv 2023
[26]

Vulagent: Hypothesis- validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025

Ziliang Wang, Ge Li, Jia Li, Hao Zhu, and Zhi Jin. Vulagent: Hypothesis- validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025

work page arXiv 2025
[27]

Smartpoc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, and Chao Zhang. Smartpoc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

work page arXiv 2025
[28]

MIT Press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2018

2018
[29]

Offline rein- forcement learning: Tutorial, review, and perspectives on open problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint, 2020

2020
[30]

Deep reinforcement learning with double q-learning

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), 2016. Accessed: 21 De- cember 2025

2016
[31]

Semfuzz: Semantics-based automatic generation of proof-of-concept exploits

Wei You, Peiyuan Zong, Kai Chen, XiaoFeng Wang, Xiaojing Liao, Pan Bian, and Bin Liang. Semfuzz: Semantics-based automatic generation of proof-of-concept exploits. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17), pages 2139–2154, October 2017. Accessed: 16 December 2025

2017
[32]

Depa: Determining ex- ploit primitives automatically for interactive programs

Jie Liu, Hang An, Jin Li, and Hongliang Liang. Depa: Determining ex- ploit primitives automatically for interactive programs. InProceedings of the 2022 6th International Conference on Electronic Information Tech- nology and Computer Engineering, pages 690–694, 2022

2022
[33]

Automatic generation of{Data-Oriented}exploits

Hong Hu, Zheng Leong Chua, Sendroiu Adrian, Prateek Saxena, and Zhenkai Liang. Automatic generation of{Data-Oriented}exploits. In 24th USENIX Security Symposium (USENIX Security 15), pages 177– 192, 2015

2015
[34]

Nothing is unreachable: Automated synthesis of robust{Code-Reuse} gadget chains for arbitrary exploitation primitives

Nicolas Bailluet, Emmanuel Fleury, Isabelle Puaut, and Erven Rohou. Nothing is unreachable: Automated synthesis of robust{Code-Reuse} gadget chains for arbitrary exploitation primitives. In34th USENIX Se- curity Symposium (USENIX Security 25), pages 625–643, 2025

2025
[35]

In 34th USENIX Security Symposium (USENIX Security 25), pages 8313– 8327, 2025

Neil P Thimmaiah, Yashashvi J Dave, Rigel Gjomemo, and VN Venkatakrishnan.{FIXX}:{FInding}{eXploits}from{eXamples}. In 34th USENIX Security Symposium (USENIX Security 25), pages 8313– 8327, 2025

2025
[36]

Saeg: Stateful automatic exploit generation

Yifan Wu, Yinshuai Li, Hong Zhu, and Yinqian Zhang. Saeg: Stateful automatic exploit generation. InEuropean Symposium on Research in Computer Security, pages 127–145. Springer, 2024

2024
[37]

Revealing the exploitability of heap overflow through poc analysis.Cybersecurity, 7(1):47, 2024

Qintao Shen, Guozhu Meng, and Kai Chen. Revealing the exploitability of heap overflow through poc analysis.Cybersecurity, 7(1):47, 2024

2024
[38]

A systematic literature re- view on automated exploit and security test generation.arXiv preprint arXiv:2502.04953, 2025

Quang-Cuong Bui, Emanuele Iannone, Maria Camporese, Torge Hin- richs, Catherine Tony, L ´aszl´o T ´oth, Fabio Palomba, P ´eter Heged ˝us, Fabio Massacci, and Riccardo Scandariato. A systematic literature re- view on automated exploit and security test generation.arXiv preprint arXiv:2502.04953, 2025

work page arXiv 2025
[39]

A chit-chat between llama 2 and chatgpt for the automated cre- ation of exploits.Computer Networks, 270:111501, 2025

Francesco Caturano, Jordan Ciotola, Simon Pietro Romano, and Mario Varlese. A chit-chat between llama 2 and chatgpt for the automated cre- ation of exploits.Computer Networks, 270:111501, 2025

2025
[40]

Deepattacker: Multi-agents collaboration based breach and attack simulation

Qingyu Qu, Peiyu Liu, Hanshu Zhao, Qiang Yang, Adil Israr, and Wei Ruan. Deepattacker: Multi-agents collaboration based breach and attack simulation. In2025 2nd International Symposium on AI and Cybersecu- rity (ISAICS), pages 1–5. IEEE, 2025

2025
[41]

and Nadiri, A

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collabora- tion: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

work page arXiv 2023
[42]

Iris: LLM-assisted static analysis for detecting security vulnerabil- ities.arXiv preprint, 2024

Yibin Chen, Linyi Zhang, Yizheng Li, James Zou, Percy Li, and Chelsea Finn. Iris: LLM-assisted static analysis for detecting security vulnerabil- ities.arXiv preprint, 2024

2024
[43]

https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024

work page arXiv 2024
[44]

Github advisory database.GitHub Security Advisories, 2025

GitHub. Github advisory database.GitHub Security Advisories, 2025. 16

2025