Recognition: 2 theorem links
· Lean TheoremPoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3
The pith
PoC-Adapt uses semantic state comparison and adaptive policies to boost PoC exploit verification reliability by 25%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
At its core, PoC-Adapt employs a Semantic Oracle for validation through pre- and post-execution state comparisons and an Adaptive Policy Learning mechanism based on reinforcement learning to guide exploit generation. Implemented with specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, the system demonstrates a 25% improvement in verification reliability on benchmarks and successfully generates verified PoCs for recent CVEs at an average cost of $0.42 per attempt.
What carries the argument
The Semantic Oracle validates exploits by comparing structured pre- and post-execution system states to distinguish true exploitation from incidental changes, supported by Adaptive Policy Learning that learns an exploitation policy over semantic states and actions to minimize failed attempts.
If this is right
- Verification reliability improves by 25% over prior LLM-based PoC systems on benchmarks.
- Exploit generation cost decreases through the use of learned adaptive policies.
- The framework can process recent CVE reports and confirm a subset of exploits at low per-unit cost.
- Multi-agent coordination with feedback loops enables end-to-end automation of vulnerability reproduction.
Where Pith is reading between the lines
- Extending semantic validation to other automated security tasks could improve accuracy in areas like malware analysis or intrusion detection.
- The approach might allow for more efficient use of LLM resources in large-scale vulnerability scanning efforts.
- If the policy learning generalizes well, it could adapt to new vulnerability types without extensive retraining.
Load-bearing premise
Comparing structured pre- and post-execution system states reliably distinguishes true vulnerability exploitation from incidental behavioral changes in diverse programs and environments.
What would settle it
Observing an instance where PoC-Adapt reports a successful exploit validation via state changes but independent analysis confirms the reported vulnerability was not triggered by the PoC.
Figures
read the original abstract
While recent approaches leverage large language models (LLMs) and multi-agent pipelines to automatically generate proof-of-concept (PoC) exploits from vulnerability reports, existing systems often suffer from two fundamental limitations: unreliable validation based on surface-level execution signals and high operational cost caused by extensive trial-and-error during exploit generation. In this paper, we present PoC-Adapt, an end-to-end framework for automated PoC generation and verification, architected upon a foundation semantic runtime validation and adaptive policy learning. At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states, enabling reliable distinction between true vulnerability exploitation and incidental behavioral changes. To reduce exploration cost, we further introduce an Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions, guiding the exploit agent toward effective strategies with fewer failed attempts. PoC-Adapt is implemented as a multi-agent system comprising specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, coordinated through structured feedback loops. Experimenting on the CWE-Bench-Java and PrimeVul benchmarks shows that PoC-Adapt significantly improves verification reliability by 25% and reduces exploit generation cost compared to prior LLM-based systems, highlighting the importance of semantic validation and learned action policies in automated vulnerability reproduction. Applied to the latest CVE corpus, PoC-Adapt confirmed 12 verified PoC out of 80 reproduce attempts at a cost of $0.42 per generated exploit
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PoC-Adapt, a multi-agent LLM framework for automated PoC exploit generation and verification. It centers on a Semantic Oracle that validates exploits via structured pre- and post-execution system state comparisons and an Adaptive Policy Learning mechanism (RL-driven) that learns exploitation policies to reduce trial-and-error. The system comprises agents for root cause analysis, environment building, exploit generation, and semantic validation. On CWE-Bench-Java and PrimeVul, it claims a 25% reliability improvement and lower costs versus prior LLM systems; on 80 recent CVEs it reports 12 verified PoCs at $0.42 per exploit.
Significance. If the evaluation claims hold under rigorous protocols, the work would advance automated vulnerability reproduction by replacing surface-level signals with semantic state validation and by learning cost-reducing policies. This directly targets two persistent bottlenecks in LLM-based exploit generation and could improve reproducibility in security research and red-teaming pipelines.
major comments (4)
- [Abstract / Evaluation] Abstract and Evaluation section: The 25% reliability improvement is asserted without defining the reliability metric, the precise baseline systems (including their configurations), statistical tests, error bars, or number of runs. This makes the central performance claim impossible to verify or reproduce.
- [Semantic Oracle / Methodology] Semantic Oracle description: The oracle is defined only at a high level via multi-agent feedback loops; no formal specification of captured state elements (heap objects, file descriptors, registers, etc.), comparison procedure, or handling of non-determinism is provided. No ablation isolates oracle accuracy from the adaptive policy, and no experiments on programs with logging, GC, or scheduler noise are reported.
- [CVE Evaluation] CVE corpus experiment: The 12/80 confirmation rate and $0.42 cost figure lack details on CVE selection criteria, definition of 'verified PoC', environment standardization, or direct comparison against the same baselines used in the benchmark experiments.
- [Adaptive Policy Learning] Adaptive Policy Learning: The RL component is introduced without specifying the state/action representation, reward function, learning algorithm, or convergence criteria, preventing assessment of whether the claimed cost reduction stems from policy learning or other factors.
minor comments (1)
- [Abstract / Introduction] The abstract and introduction repeat the same high-level claims without adding concrete implementation or evaluation details; a dedicated 'Limitations' or 'Threats to Validity' subsection would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional rigor and specification will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested clarifications, definitions, and experiments.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: The 25% reliability improvement is asserted without defining the reliability metric, the precise baseline systems (including their configurations), statistical tests, error bars, or number of runs. This makes the central performance claim impossible to verify or reproduce.
Authors: We agree that the reliability metric and supporting evaluation details must be defined explicitly for reproducibility. In the revised manuscript we will define reliability as the fraction of generation attempts that produce a PoC passing semantic validation. We will name the exact baseline systems and their configurations, report results over multiple independent runs with error bars, and include appropriate statistical tests (e.g., paired t-tests). These additions will appear in the Evaluation section and be summarized in the Abstract. revision: yes
-
Referee: [Semantic Oracle / Methodology] Semantic Oracle description: The oracle is defined only at a high level via multi-agent feedback loops; no formal specification of captured state elements (heap objects, file descriptors, registers, etc.), comparison procedure, or handling of non-determinism is provided. No ablation isolates oracle accuracy from the adaptive policy, and no experiments on programs with logging, GC, or scheduler noise are reported.
Authors: We acknowledge that the current description of the Semantic Oracle is high-level. The revision will supply a formal specification of the captured state elements (heap objects, file descriptors, registers, and other runtime artifacts), the structured comparison procedure, and the method for handling non-determinism (repeated executions with threshold-based matching). We will also add an ablation isolating the oracle from the adaptive policy and new experiments evaluating robustness under logging, garbage-collection, and scheduler noise. revision: yes
-
Referee: [CVE Evaluation] CVE corpus experiment: The 12/80 confirmation rate and $0.42 cost figure lack details on CVE selection criteria, definition of 'verified PoC', environment standardization, or direct comparison against the same baselines used in the benchmark experiments.
Authors: We agree these experimental details are necessary. The revised manuscript will state the CVE selection criteria (recent CVEs with public reports and reproducible environments), define a verified PoC as one that passes the Semantic Oracle and produces the intended effect, describe the standardized containerized environment, and present direct head-to-head comparisons against the benchmark baselines on the CVE corpus. revision: yes
-
Referee: [Adaptive Policy Learning] Adaptive Policy Learning: The RL component is introduced without specifying the state/action representation, reward function, learning algorithm, or convergence criteria, preventing assessment of whether the claimed cost reduction stems from policy learning or other factors.
Authors: We recognize that the RL mechanism requires precise specification. The revision will detail the state representation (semantic vectors from the Oracle), action space (exploitation primitives), reward function (success bonus minus per-step cost), learning algorithm, and convergence criteria. These additions will clarify the source of the observed cost reduction. revision: yes
Circularity Check
No circularity in PoC-Adapt derivation chain
full rationale
The paper defines the Semantic Oracle via pre/post state comparison and the Adaptive Policy via RL over semantic states/actions as independent architectural components. Reported gains (25% reliability, 12/80 CVE confirmations, $0.42 cost) are presented as empirical outcomes on external benchmarks (CWE-Bench-Java, PrimeVul, CVE corpus) rather than reductions to fitted inputs or self-referential definitions. No equations, self-citations, or uniqueness theorems are invoked that collapse the central claims back to their own inputs by construction. The derivation remains self-contained against observable execution states.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can be orchestrated in multi-agent pipelines to perform root cause analysis, environment setup, and exploit code generation from vulnerability reports
invented entities (2)
-
Semantic Oracle
no independent evidence
-
Adaptive Policy Learning mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states... Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions... DDQN
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
formulates exploit generation as a MDP and trains a DDQN offline using exploitation logs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2025 q4 vulnerability publication forecast.FIRST Blog, October 2025
´Eireann Leverett. 2025 q4 vulnerability publication forecast.FIRST Blog, October 2025. Accessed: 2 December 2025
2025
-
[2]
2025 data breach investigations report (dbir).Verizon Business, 2025
Verizon. 2025 data breach investigations report (dbir).Verizon Business, 2025
2025
-
[3]
Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems
Teresa Nicole Brooks. Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning systems. InScience and Information Conference, pages 1083–1102. Springer, 2018
2018
-
[4]
Survey of methods for automated code-reuse exploit generation.Programming and Com- puter Software, 47(4):271–297, 2021
Alexey V Vishnyakov and Alexey R Nurmukhametov. Survey of methods for automated code-reuse exploit generation.Programming and Com- puter Software, 47(4):271–297, 2021
2021
-
[5]
Scaling security testing by addressing the reachability gap
Gaetano Sapia and Marcel B ¨ohme. Scaling security testing by addressing the reachability gap. InInternational Conference on Software Engineer- ing (ICSE), 2026
2026
-
[6]
Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, and Zheli Liu. A systematic study on generating web vulnerability proof-of-concepts using large language models.arXiv preprint arXiv:2510.10148, 2025
-
[7]
Au- tomatic patch-based exploit generation is possible: Techniques and impli- cations
David Brumley, Pongsin Poosankam, Dawn Song, and Jiang Zheng. Au- tomatic patch-based exploit generation is possible: Techniques and impli- cations. InProceedings of the IEEE Symposium on Security and Privacy (SP), April 2008. Accessed: 16 December 2025
2008
-
[8]
Aeg: Automatic exploit generation
Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. Aeg: Automatic exploit generation. InProceedings of the Network and Distributed System Security Symposium (NDSS), February
-
[9]
Accessed: 16 December 2025
2025
-
[10]
Chainsaw: Chained automated workflow-based exploit generation
Abeer Alhuzali, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrish- nan. Chainsaw: Chained automated workflow-based exploit generation. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 641–652, 2016
2016
-
[11]
ARVO: Atlas of Re- producible Vulnerabilities for Open Source Software, August 2024
Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doup´e, Hammond Pearce, Brendan Dolan-Gavitt, et al. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153, 2024
- [12]
-
[13]
Juefei Pu, Xingyu Li, Haonan Li, Zhengchuan Liang, Jonathan Cox, Yi- fan Wu, Kareem Shehada, Arrdya Srivastav, and Zhiyun Qian. Patch-to- poc: A systematic study of agentic llm systems for linux kernel n-day reproduction.arXiv preprint arXiv:2602.07287, 2026
-
[14]
Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, and Martin Mon- perrus. Poco: Agentic proof-of-concept exploit generation for smart con- tracts.arXiv preprint arXiv:2511.02780, 2025. 15
-
[15]
Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing
Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Se- curity 24), pages 847–864, Philadelphia, PA, August 2024. USENIX As- sociation
2024
-
[16]
Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, and Phan The Duy. xoffense: An ai-driven autonomous penetration testing framework with offensive knowledge-enhanced llms and multi agent systems.arXiv preprint arXiv:2509.13021, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real- world penetration testing?arXiv preprint arXiv:2602.17622, 2026
-
[18]
Ptfusion: Llm-driven context-aware knowledge fusion for web penetration testing.Information Fusion, page 103731, 2025
Wenhao Wang, Hao Gu, Zhixuan Wu, Hao Chen, Xingguo Chen, and Fan Shi. Ptfusion: Llm-driven context-aware knowledge fusion for web penetration testing.Information Fusion, page 103731, 2025
2025
-
[19]
Faultline: Automated proof-of-vulnerability generation using llm agents
Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. Faultline: Automated proof-of-vulnerability generation using llm agents. arXiv preprint, 2025
2025
-
[20]
From cve entries to verifiable exploits: An automated multi- agent framework for reproducing cves.arXiv preprint, 2025
Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. From cve entries to verifiable exploits: An automated multi- agent framework for reproducing cves.arXiv preprint, 2025
2025
-
[21]
Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023
2023
-
[22]
React: Synergizing reasoning and acting in language models, 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023
2023
-
[23]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023
2023
-
[24]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Pentestgpt: An llm- empowered automatic penetration testing tool
Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentest- gpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023
-
[26]
Ziliang Wang, Ge Li, Jia Li, Hao Zhu, and Zhi Jin. Vulagent: Hypothesis- validation based multi-agent vulnerability detection.arXiv preprint arXiv:2509.11523, 2025
-
[27]
Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, and Chao Zhang. Smartpoc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025
-
[28]
MIT Press, 2018
Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2018
2018
-
[29]
Offline rein- forcement learning: Tutorial, review, and perspectives on open problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint, 2020
2020
-
[30]
Deep reinforcement learning with double q-learning
Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), 2016. Accessed: 21 De- cember 2025
2016
-
[31]
Semfuzz: Semantics-based automatic generation of proof-of-concept exploits
Wei You, Peiyuan Zong, Kai Chen, XiaoFeng Wang, Xiaojing Liao, Pan Bian, and Bin Liang. Semfuzz: Semantics-based automatic generation of proof-of-concept exploits. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17), pages 2139–2154, October 2017. Accessed: 16 December 2025
2017
-
[32]
Depa: Determining ex- ploit primitives automatically for interactive programs
Jie Liu, Hang An, Jin Li, and Hongliang Liang. Depa: Determining ex- ploit primitives automatically for interactive programs. InProceedings of the 2022 6th International Conference on Electronic Information Tech- nology and Computer Engineering, pages 690–694, 2022
2022
-
[33]
Automatic generation of{Data-Oriented}exploits
Hong Hu, Zheng Leong Chua, Sendroiu Adrian, Prateek Saxena, and Zhenkai Liang. Automatic generation of{Data-Oriented}exploits. In 24th USENIX Security Symposium (USENIX Security 15), pages 177– 192, 2015
2015
-
[34]
Nothing is unreachable: Automated synthesis of robust{Code-Reuse} gadget chains for arbitrary exploitation primitives
Nicolas Bailluet, Emmanuel Fleury, Isabelle Puaut, and Erven Rohou. Nothing is unreachable: Automated synthesis of robust{Code-Reuse} gadget chains for arbitrary exploitation primitives. In34th USENIX Se- curity Symposium (USENIX Security 25), pages 625–643, 2025
2025
-
[35]
In 34th USENIX Security Symposium (USENIX Security 25), pages 8313– 8327, 2025
Neil P Thimmaiah, Yashashvi J Dave, Rigel Gjomemo, and VN Venkatakrishnan.{FIXX}:{FInding}{eXploits}from{eXamples}. In 34th USENIX Security Symposium (USENIX Security 25), pages 8313– 8327, 2025
2025
-
[36]
Saeg: Stateful automatic exploit generation
Yifan Wu, Yinshuai Li, Hong Zhu, and Yinqian Zhang. Saeg: Stateful automatic exploit generation. InEuropean Symposium on Research in Computer Security, pages 127–145. Springer, 2024
2024
-
[37]
Revealing the exploitability of heap overflow through poc analysis.Cybersecurity, 7(1):47, 2024
Qintao Shen, Guozhu Meng, and Kai Chen. Revealing the exploitability of heap overflow through poc analysis.Cybersecurity, 7(1):47, 2024
2024
-
[38]
Quang-Cuong Bui, Emanuele Iannone, Maria Camporese, Torge Hin- richs, Catherine Tony, L ´aszl´o T ´oth, Fabio Palomba, P ´eter Heged ˝us, Fabio Massacci, and Riccardo Scandariato. A systematic literature re- view on automated exploit and security test generation.arXiv preprint arXiv:2502.04953, 2025
-
[39]
A chit-chat between llama 2 and chatgpt for the automated cre- ation of exploits.Computer Networks, 270:111501, 2025
Francesco Caturano, Jordan Ciotola, Simon Pietro Romano, and Mario Varlese. A chit-chat between llama 2 and chatgpt for the automated cre- ation of exploits.Computer Networks, 270:111501, 2025
2025
-
[40]
Deepattacker: Multi-agents collaboration based breach and attack simulation
Qingyu Qu, Peiyu Liu, Hanshu Zhao, Qiang Yang, Adil Israr, and Wei Ruan. Deepattacker: Multi-agents collaboration based breach and attack simulation. In2025 2nd International Symposium on AI and Cybersecu- rity (ISAICS), pages 1–5. IEEE, 2025
2025
-
[41]
Yashar Talebirad and Amirhossein Nadiri. Multi-agent collabora- tion: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023
-
[42]
Iris: LLM-assisted static analysis for detecting security vulnerabil- ities.arXiv preprint, 2024
Yibin Chen, Linyi Zhang, Yizheng Li, James Zou, Percy Li, and Chelsea Finn. Iris: LLM-assisted static analysis for detecting security vulnerabil- ities.arXiv preprint, 2024
2024
-
[43]
https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624, 2024
-
[44]
Github advisory database.GitHub Security Advisories, 2025
GitHub. Github advisory database.GitHub Security Advisories, 2025. 16
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.