pith. sign in

arxiv: 2605.03956 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.SE

Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software

Pith reviewed 2026-05-07 15:09 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords proof-of-vulnerability testssoftware supply chain securityLLM-based test generationvulnerable librariescall path analysisagent-based test generationJava applicationsexecutable security tests
0
0 comments X

The pith

PoVSmith combines call path analysis with LLM prompts and execution feedback to automatically generate proof-of-vulnerability tests for applications using vulnerable libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Applications built on third-party libraries can become vulnerable when library flaws are reachable through app code. Developers need concrete, executable proof-of-vulnerability tests to judge real security risk, yet manual creation is difficult and prior automation falls short. PoVSmith feeds call-path details, exemplar tests, and runtime feedback into prompts for coding agents and large language models to produce, run, and evaluate such tests. Evaluated on 33 Java app-library pairs with known vulnerabilities, the method identified 96 percent of relevant entry points correctly and generated 152 tests of which 55 percent demonstrated feasible attacks. It reduces reliance on human effort while raising test quality over earlier LLM-only baselines.

Core claim

PoVSmith is a new agent-based approach that integrates static call-path analysis, code context, and iterative execution feedback into multiple prompts to direct a coding agent and large language model through test generation, execution, and quality assessment, yielding executable PoV tests that expose how library vulnerabilities propagate into dependent applications.

What carries the argument

The iterative prompting loop that supplies call-path information from application entry points to vulnerable library APIs together with execution logs to guide LLM test creation and refinement.

If this is right

  • Developers receive concrete evidence of supply-chain risks without writing tests themselves.
  • 96 percent of application-level entry points that reach vulnerable library APIs are located along with their call paths.
  • 55 percent of the 152 generated tests succeed in demonstrating feasible attacks on the applications.
  • Human involvement drops while test quality rises compared with prior LLM-based methods.
  • The same prompting structure supports both test creation and automated quality assessment grounded in context and logs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be extended to languages other than Java by adapting the call-path extractor and runtime instrumentation.
  • Embedding PoVSmith in continuous-integration pipelines would allow automatic flagging of exploitable dependency vulnerabilities before deployment.
  • Higher success rates might follow from richer feedback signals such as coverage metrics or symbolic execution traces.
  • The generated tests could serve as regression oracles for future library updates to confirm that fixes remain effective.

Load-bearing premise

LLM-generated tests guided by call paths and execution feedback reliably indicate real-world attack feasibility without systematic false positives or negatives.

What would settle it

Independent manual verification by security experts showing that a substantial fraction of the 84 tests labeled successful do not actually produce exploitable behavior in the target applications, or that many known feasible attacks are missed.

Figures

Figures reproduced from arXiv: 2605.03956 by Danfeng Yao, Na Meng, Shravya Kanchi, Xiaoyan Zang, Ying Zhang.

Figure 1
Figure 1. Figure 1: The threat model of software supply chain attacks view at source ↗
Figure 2
Figure 2. Figure 2: PoVSmith has four phases As shown in view at source ↗
Figure 3
Figure 3. Figure 3: A simplified version of the prompt template we view at source ↗
Figure 4
Figure 4. Figure 4: The template used for iterative PoV test generation view at source ↗
Figure 5
Figure 5. Figure 5: The prompt template we used to assess test quality view at source ↗
Figure 6
Figure 6. Figure 6: One PoV test that PoVSmith successfully generated view at source ↗
Figure 7
Figure 7. Figure 7: One generated test that fails to demonstrate PoV view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of call paths by length have length 1, meaning that the identified methods directly call vulnerable APIs; the other 76 paths have longer lengths (i.e., 2–5), showing how the identified methods indirectly call APIs. These 216 paths correspond to 158 unique source methods, as some paths share the same source and sink methods. For the nine call paths that were incorrectly identified, we ob￾served… view at source ↗
read the original abstract

Developers create modern software applications (Apps) on top of third-party libraries (Libs). When library vulnerabilities are reachable through application code, the applications can be vulnerable to software supply chain attacks. Prior work shows that developers often require concrete and executable evidence, i.e., proof-of-vulnerability (PoV) tests, to decide whether a reported dependency vulnerability poses a practical security risk to their application. However, manually crafting such tests is challenging, and existing tool support is insufficient to automate the procedure. To streamline test generation, we created PoVSmith -- a new approach that combines call path analysis, exemplar test, code context, and feedback into multiple prompts to guide a coding agent (i.e., Codex) and a large language model (i.e., GPT) for test generation, execution, and assessment. We evaluated PoVSmith on 33 $\langle$App, Lib$\rangle$ Java program pairs, where each App depends on a vulnerable Lib. PoVSmith revealed 158 unique application-level entry points (i.e., public methods) calling vulnerable library APIs; 152 (96\%) of them were correctly found, together with the call paths properly recognized. With such method call information, PoVSmith generated 152 tests, 84 (55\%) of which demonstrated feasible ways of attacking Apps by exploiting Lib vulnerabilities. PoVSmith substantially outperforms the state-of-the-art LLM-based approach, as it reduces human involvement while dramatically improving test quality. Our work contributes (1) a novel approach of agent-based test generation, (2) an iterative code refinement process driven by execution feedback, and (3) LLM-based quality assessment grounded in both the test context and execution logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PoVSmith, an LLM-agent approach that combines call-path analysis, exemplar tests, code context, and execution feedback to generate and assess proof-of-vulnerability (PoV) tests for Java applications that depend on vulnerable libraries. On 33 App-Lib pairs, it reports identifying 158 entry points (96% accuracy) and producing 152 tests, of which 84 (55%) are assessed by an LLM as demonstrating feasible attacks on the applications.

Significance. If the LLM-based feasibility judgments prove reliable, the work could meaningfully lower the barrier for developers to obtain concrete evidence of supply-chain risk, complementing existing static analysis and fuzzing tools. The agent-driven iterative refinement loop and grounding of assessment in both context and logs represent a practical advance over prior LLM-only baselines for security test generation.

major comments (2)
  1. [Evaluation section] Evaluation section (results on 33 pairs and the 55% figure): the claim that 84 tests 'demonstrated feasible ways of attacking Apps by exploiting Lib vulnerabilities' rests entirely on an LLM assessor that receives only the provided call-path context plus execution logs. No human oracle, CVE-specific exploit oracle, differential comparison against manually written PoVs, or other independent validation is described to calibrate the false-positive rate of this assessor. Because every downstream claim (outperformance vs. prior LLM baselines, reduction in human effort, practical effectiveness) is computed from the same LLM-labeled count, this is load-bearing for the central contribution.
  2. [Abstract and Approach section] Abstract and Approach section: the description of the GPT-based quality assessor provides no details on the exact prompt template, decision criteria for labeling a test 'feasible,' handling of LLM variability (e.g., temperature, multiple runs), or inter-rater agreement with any external ground truth. This omission prevents readers from assessing the reproducibility and soundness of the 55% success rate.
minor comments (2)
  1. [Abstract] The abstract states '152 (96%) of them were correctly found' but does not clarify whether the 6% error rate was measured against a manually verified ground truth or another automated method; adding this detail would strengthen the entry-point accuracy claim.
  2. [Evaluation section] Table or figure presenting the 33 program pairs should include basic statistics (e.g., lines of code, number of vulnerable APIs per pair) to allow readers to judge the diversity and representativeness of the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of PoVSmith to lower barriers for assessing supply-chain risks. We address each major comment below with clarifications and proposed revisions to improve the manuscript's rigor and reproducibility.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (results on 33 pairs and the 55% figure): the claim that 84 tests 'demonstrated feasible ways of attacking Apps by exploiting Lib vulnerabilities' rests entirely on an LLM assessor that receives only the provided call-path context plus execution logs. No human oracle, CVE-specific exploit oracle, differential comparison against manually written PoVs, or other independent validation is described to calibrate the false-positive rate of this assessor. Because every downstream claim (outperformance vs. prior LLM baselines, reduction in human effort, practical effectiveness) is computed from the same LLM-labeled count, this is load-bearing for the central contribution.

    Authors: We acknowledge that the 55% feasibility rate is determined solely by the LLM assessor without an independent human oracle, CVE-specific exploit validation, or direct comparison to manually crafted PoVs. This choice supports scalability and aligns with our aim to reduce human involvement in PoV generation. The assessor receives call-path context, test code, and execution logs to ground judgments in observable behavior. To address the concern, we will revise the Evaluation section to explicitly discuss this as a limitation, add a small-scale manual calibration (reviewing a random subset of 20 tests for agreement with the LLM labels), and qualify the outperformance claims relative to baselines that use comparable automated assessment. This provides partial calibration of reliability without requiring a full re-evaluation of all 152 tests. revision: partial

  2. Referee: [Abstract and Approach section] Abstract and Approach section: the description of the GPT-based quality assessor provides no details on the exact prompt template, decision criteria for labeling a test 'feasible,' handling of LLM variability (e.g., temperature, multiple runs), or inter-rater agreement with any external ground truth. This omission prevents readers from assessing the reproducibility and soundness of the 55% success rate.

    Authors: We agree that the current description lacks sufficient detail on the assessor for full reproducibility. In the revised manuscript, we will expand the Approach section to include the complete prompt template, the precise decision criteria (e.g., positive label if logs indicate successful vulnerability trigger such as exception patterns or data exfiltration), our use of fixed low temperature (0.0) and single-run execution per test to minimize variability, and an explicit statement that inter-rater agreement with external ground truth was not computed. We will also add this as a noted limitation with suggestions for future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper describes an LLM-guided test generation pipeline (call-path analysis + prompting + execution feedback + LLM assessment) and reports direct counts on an external set of 33 App-Lib pairs. The 152 tests and 84/152 success figure are produced by applying the described procedure to those pairs; success is measured by the LLM assessor using provided context and logs, but this is an explicit component of the method rather than a self-definitional loop or fitted parameter renamed as prediction. No equations, self-citations, or uniqueness theorems are invoked to force the result. The evaluation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the unproven assumption that LLMs can be reliably steered for security test generation; the paper introduces the PoVSmith system as its main contribution.

axioms (1)
  • domain assumption Large language models can produce correct, executable security tests when supplied with call paths, code context, and execution feedback.
    This capability is invoked as the foundation for the entire test generation and refinement process.
invented entities (1)
  • PoVSmith no independent evidence
    purpose: Agent-based system for automated PoV test generation
    The paper proposes this new tool without external independent validation beyond the internal evaluation.

pith-pipeline@v0.9.0 · 5617 in / 1276 out tokens · 74430 ms · 2026-05-07T15:09:48.550877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 38 canonical work pages

  1. [1]

    Arrange-Act-Assert: A Pattern for Writing Good Tests

    2020. Arrange-Act-Assert: A Pattern for Writing Good Tests. https://automationpanda.com/2020/07/07/arrange-act-assert-a-pattern- for-writing-good-tests/

  2. [2]

    OWASP Dependency-Check

    2020. OWASP Dependency-Check. https://owasp.org/www-project- dependency-check/

  3. [3]

    Find Security Bugs

    2021. Find Security Bugs. https://find-sec-bugs.github.io/

  4. [4]

    Supply chain attacks show why you should be wary of third-party providers

    2021. Supply chain attacks show why you should be wary of third-party providers

  5. [5]

    alibaba / fastjson

    2023. alibaba / fastjson. https://github.com/alibaba/fastjson

  6. [6]

    american fuzzy lop

    2023. american fuzzy lop. https://lcamtuf.coredump.cx/afl/

  7. [7]

    2023. Codec. https://commons.apache.org/proper/commons-codec/

  8. [8]

    2023. Dom4j. https://dom4j.github.io

  9. [9]

    OSS-Fuzz

    2023. OSS-Fuzz. https://google.github.io/oss-fuzz/

  10. [10]

    spring-projects / spring-security

    2023. spring-projects / spring-security. https://github.com/spring-projects/ spring-security

  11. [11]

    Amazon: How MOVEit Supply Chain Attack Left Echoing Ef- fects

    2024. Amazon: How MOVEit Supply Chain Attack Left Echoing Ef- fects. https://cybermagazine.com/articles/amazon-how-moveit-supply-chain- attack-left-lasting-effects

  12. [12]

    GPT-5.1: A smarter, more conversational ChatGPT

    2025. GPT-5.1: A smarter, more conversational ChatGPT. https://openai.com/ index/gpt-5-1/

  13. [13]

    Software supply chain attacks surge, as ransomware groups escalate and in- dustrial sectors face more exposure

    2025. Software supply chain attacks surge, as ransomware groups escalate and in- dustrial sectors face more exposure. https://industrialcyber.co/reports/software- supply-chain-attacks-surge-as-ransomware-groups-escalate-and-industrial- sectors-face-more-exposure/

  14. [14]

    The Log4j Vulnerability: What It Is, What Organizations Are at Risk and How You Can Protect Yourself

    2025. The Log4j Vulnerability: What It Is, What Organizations Are at Risk and How You Can Protect Yourself. https://www.abs-group.com/Knowledge- Center/Insights/The-Log4j-Vulnerability-What-It-Is-What-Organizations- Are-at-Risk-and-How-You-Can-Protect-Yourself/

  15. [15]

    What Are Software Supply Chain Vulnerabilities? Understanding the Risks and How to Mitigate Them

    2025. What Are Software Supply Chain Vulnerabilities? Understanding the Risks and How to Mitigate Them. https://safe.security/resources/insights/what- are-software-supply-chain-vulnerabilities-understanding-the-risks-how-to- mitigate-them/#How-Attackers-Exploit-These-Vulnerabilities

  16. [16]

    2026 Software Supply Chain Security Report

    2026. 2026 Software Supply Chain Security Report. https://www.reversinglabs. com/sscs-report

  17. [17]

    2026. CodeQL. https://codeql.github.com

  18. [18]

    Codex | AI Coding Partner from OpenAI | OpenAI

    2026. Codex | AI Coding Partner from OpenAI | OpenAI. https://openai.com/ codex/

  19. [19]

    Dependabot

    2026. Dependabot. https://github.com/dependabot

  20. [20]

    dependency-check vulnerabilities

    2026. dependency-check vulnerabilities. https://security.snyk.io/package/npm/ dependency-check

  21. [21]

    Gemini Code Assist | AI coding assistant

    2026. Gemini Code Assist | AI coding assistant. https://codeassist.google

  22. [22]

    go4retro/tcpser4j

    2026. go4retro/tcpser4j. https://github.com/go4retro/tcpser4j/blob/ 7a3dbd8d719c0b256bb49da85227b73d580c6c82/gensrc/org/jbrain/tcpser4j/ binding/PhoneBook.java

  23. [23]

    huangsigit/commerce

    2026. huangsigit/commerce. https://github.com/huangsigit/commerce/ blob/899f81c2080bfa4223176e4bd06c701b4d50958c/src/main/java/com/egao/ common/core/utils/JSONUtil.java

  24. [24]

    Introducing GPT-5.2-Codex | OpenAI

    2026. Introducing GPT-5.2-Codex | OpenAI. https://openai.com/index/ introducing-gpt-5-2-codex/

  25. [25]

    mistralai/mistral-vibe: Minimal CLI coding agent by Mistral

    2026. mistralai/mistral-vibe: Minimal CLI coding agent by Mistral. https://github. com/mistralai/mistral-vibe

  26. [26]

    NVD - cve-2018-1000632

    2026. NVD - cve-2018-1000632. https://nvd.nist.gov/vuln/detail/cve-2018- 1000632

  27. [27]

    soot-oss/soot: Soot - A Java optimization framework

    2026. soot-oss/soot: Soot - A Java optimization framework. https://github.com/ soot-oss/soot

  28. [28]

    wala/WALA: T.J

    2026. wala/WALA: T.J. Watson Libraries for Analysis, with front ends for Java, Android, and JavaScript, and many common static program analyses. https: //github.com/wala/wala

  29. [29]

    What is agentic coding? https://cloud.google.com/discover/what-is- agentic-coding

    2026. What is agentic coding? https://cloud.google.com/discover/what-is- agentic-coding

  30. [30]

    What is penetration testing? https://www.ibm.com/think/topics/ penetration-testing

    2026. What is penetration testing? https://www.ibm.com/think/topics/ penetration-testing

  31. [31]

    What is Software Supply Chain Security? https://jfrog.com/learn/software- supply-chain/

    2026. What is Software Supply Chain Security? https://jfrog.com/learn/software- supply-chain/

  32. [32]

    Baleegh Ahmad, Shailja Thakur, Benjamin Tan, Ramesh Karri, and Hammond Pearce. 2024. On Hardware Security Bug Code Fixes by Prompting Large Lan- guage Models.IEEE Transactions on Information Forensics and Security19 (2024), 4043–4057. doi:10.1109/TIFS.2024.3374558

  33. [33]

    Alshmrany, Mohannad Aldughaim, Ahmed Bhayat, and Lucas C

    Kaled M. Alshmrany, Mohannad Aldughaim, Ahmed Bhayat, and Lucas C. Cordeiro. 2021. FuSeBMC: An Energy-Efficient Test Generator for Finding Security Vulnerabilities in C Programs. InTests and Proofs, Frédéric Loulergue and Franz Wotawa (Eds.). Springer International Publishing, Cham, 85–105

  34. [34]

    Schwartz, Mav- erick Woo, and David Brumley

    Thanassis Avgerinos, Sang Kil Cha, Alexandre Rebert, Edward J. Schwartz, Mav- erick Woo, and David Brumley. 2014. Automatic exploit generation.Commun. ACM57, 2 (Feb. 2014), 74–84. doi:10.1145/2560217.2560219

  35. [35]

    Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing Mayhem on Binary Code. In2012 IEEE Symposium on Security and Privacy. 380–394. doi:10.1109/SP.2012.31

  36. [36]

    Sujita Chaudhary, Austin O’Brien, and Shengjie Xu. 2020. Automated Post-Breach Penetration Testing through Reinforcement Learning. In2020 IEEE Conference on Communications and Network Security (CNS). 1–2. doi:10.1109/CNS48642.2020. 9162301

  37. [37]

    Zimin Chen, Steve Kommrusch, and Martin Monperrus. 2023. Neural Transfer Learning for Repairing Security Vulnerabilities in C Code.IEEE Transactions on Software Engineering49, 1 (2023), 147–165. doi:10.1109/TSE.2022.3147265

  38. [38]

    Jianlei Chi, Yu Qu, Ting Liu, Qinghua Zheng, and Heng Yin. 2023. SeqTrans: Au- tomatic Vulnerability Fix Via Sequence to Sequence Learning.IEEE Transactions on Software Engineering49, 2 (2023), 564–585. doi:10.1109/TSE.2022.3156637

  39. [39]

    Ge Chu and Alexei Lisitsa. 2018. Penetration Testing for Internet of Things and Its Automation. In2018 IEEE 20th International Conference on High Per- formance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Sys- tems (HPCC/SmartCity/DSS). 1479–1484. doi:10.1109/HPCC/Sma...

  40. [40]

    Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin. 2024. Generalization-Enhanced Code Vulnerability Detection via Multi- Task Instruction Fine-Tuning. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics,...

  41. [41]

    Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung

  42. [42]

    InPro- ceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore) (ESEC/FSE 2022)

    VulRepair: a T5-based automated software vulnerability repair. InPro- ceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 935–947. doi:10.1145/3540250.3549098

  43. [43]

    Yuejun Guo, Constantinos Patsakis, Qiang Hu, Qiang Tang, and Fran Casino. 2024. Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection. InComputer Security – ESORICS 2024: 29th European Symposium on Research in Computer Security, Bydgoszcz, Poland, September 16–20, 2024, Proceedings, Part I(Bydgoszcz, Poland). Springer-Ve...

  44. [44]

    Sihao Hu, Tiansheng Huang, Fatih İlhan, Selim Furkan Tekin, and Ling Liu. 2023. Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives. In2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). 297–306. doi:10.1109/ TPS-ISA58951.2023.00044

  45. [45]

    Zhenguo Hu, Razvan Beuran, and Yasuo Tan. 2020. Automated Penetration Test- ing Using Deep Reinforcement Learning. In2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 2–10. doi:10.1109/EuroSPW51379. 2020.00010

  46. [46]

    Junjie Huang and Quanyan Zhu. 2024. PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation. InProceedings of the Workshop on Autonomous Cybersecurity(Salt Lake City, UT, USA)(AutonomousCyber ’24). Association for Computing Machinery, New York, NY, USA, 11–22. doi:10.1145/ 3689933.3690831

  47. [47]

    Emanuele Iannone, Dario Di Nucci, Antonino Sabetta, and Andrea De Lucia

  48. [48]

    In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC)

    Toward automated exploit generation for known vulnerabilities in open- source libraries. In2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 396–400

  49. [49]

    Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. InferFix: End-to-End Program Repair with LLMs. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing...

  50. [50]

    Md Mahir Asef Kabir, Ying Wang, Danfeng Yao, and Na Meng. 2022. How Do Developers Follow Security-Relevant Best Practices When Using NPM Packages?. In2022 IEEE Secure Development Conference (SecDev). IEEE Computer Society, Los Alamitos, CA, USA, 77–83. doi:10.1109/SecDev53368.2022.00027

  51. [51]

    Hong Jin Kang, Truong Giang Nguyen, Bach Le, Corina S Păsăreanu, and David Lo. 2022. Test mimicry to assess the exploitability of library vulnerabilities. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 276–288

  52. [52]

    11 Shravya Kanchi, Xiaoyan Zang, Ying Zhang, Danfeng (Daphne) Yao, and Na Meng

    Stefan Krüger, Sarah Nadi, Michael Reif, Karim Ali, Mira Mezini, Eric Bod- den, Florian Göpfert, Felix Günther, Christian Weinert, Daniel Demmler, et al. 11 Shravya Kanchi, Xiaoyan Zang, Ying Zhang, Danfeng (Daphne) Yao, and Na Meng

  53. [53]

    In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

    CogniCrypt: supporting developers in using cryptography. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 931–936

  54. [54]

    Tan Khang Le, Saba Alimadadi, and Steven Y. Ko. 2024. A Study of Vulnerability Repair in JavaScript Programs with Large Language Models. InCompanion Proceedings of the ACM Web Conference 2024(Singapore, Singapore)(WWW ’24). Association for Computing Machinery, New York, NY, USA, 666–669. doi:10. 1145/3589335.3651463

  55. [55]

    Guochang Li, Chen Zhi, Jialiang Chen, Junxiao Han, and Shuiguang Deng

  56. [56]

    InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24)

    Exploring Parameter-Efficient Fine-Tuning of Large Language Model on Automated Program Repair. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 719–731. doi:10.1145/3691620.3695066

  57. [57]

    Bissyandé

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: revisiting template-based automated program repair. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis(Beijing, China)(ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 31–42. doi:10.1145/3293882.3330577

  58. [58]

    Zhihong Liu, Qing Liao, Wenchao Gu, and Cuiyun Gao. 2023. Software Vulner- ability Detection with GPT and In-Context Learning. In2023 8th International Conference on Data Science in Cyberspace (DSC). 229–236. doi:10.1109/DSC59305. 2023.00041

  59. [59]

    Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. 2024. Prompt Fuzzing for Fuzz Driver Generation. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security(Salt Lake City, UT, USA)(CCS ’24). Association for Computing Machinery, New York, NY, USA, 3793–3807. doi:10. 1145/3658644.3670396

  60. [60]

    Siqi Ma, Ferdian Thung, David Lo, Cong Sun, and Robert H. Deng. 2017. VuRLE: Automatic Vulnerability Detection and Repair by Learning from Examples. In Computer Security – ESORICS 2017, Simon N. Foley, Dieter Gollmann, and Einar Snekkenes (Eds.). Springer International Publishing, Cham, 229–246

  61. [61]

    Matias Martinez and Martin Monperrus. 2018. Ultra-Large Repair Search Space with Automatically Mined Templates: The Cardumen Mode of Astor. InSearch-Based Software Engineering, Thelma Elita Colanzi and Phil McMinn (Eds.). Springer International Publishing, Cham, 65–86

  62. [62]

    Ravindra Metta, Raveendra Kumar Medicherla, and Samarjit Chakraborty. 2022. BMC+Fuzz: Efficient and Effective Test Generation. In2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1419–1424. doi:10.23919/ DATE54114.2022.9774672

  63. [63]

    Marwan Omar and Stavros Shiaeles. 2023. VulDetect: A novel technique for detecting software vulnerabilities using Language Models. In2023 IEEE Interna- tional Conference on Cyber Security and Resilience (CSR). 105–110. doi:10.1109/ CSR57506.2023.10224924

  64. [64]

    Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Language Models . In2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 2339–2356. doi:10.1109/SP46215.2023. 10179420

  65. [65]

    Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2020. Detection, assess- ment and mitigation of vulnerabilities in open source dependencies.Empirical Software Engineering25, 5 (2020), 3175–3215

  66. [66]

    Derry Pratama, Naufal Suryanto, Andro Aprila Adiputra, Thi-Thu-Huong Le, Ahmada Yusril Kadiptya, Muhammad Iqbal, and Howon Kim. 2024. CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher. Sensors24, 21 (2024). doi:10.3390/s24216878

  67. [67]

    Radford, and Bill Chu

    Moumita Das Purba, Arpita Ghosh, Benjamin J. Radford, and Bill Chu. 2023. Software Vulnerability Detection using Large Language Models. In2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). 112–119. doi:10.1109/ISSREW60843.2023.00058

  68. [68]

    Sazzadur Rahaman, Ya Xiao, Sharmin Afrose, Fahad Shaon, Ke Tian, Miles Frantz, Murat Kantarcioglu, and Danfeng Yao. 2019. Cryptoguard: High precision detec- tion of cryptographic vulnerabilities in massive-sized Java projects. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2455–2472

  69. [69]

    Maria Rigaki, Ondřej Lukáš, Carlos Catania, and Sebastian Garcia. 2024. Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments. InPro- ceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART. INSTICC, SciTePress, 774–781. doi:10.5220/0012391800003636

  70. [70]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Con- tracts by Combining GPT with Program Analysis. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New ...

  71. [71]

    2017.Fuzzing for Software Security Testing and Quality Assurance, Second Edition

    Ari Takanen, Jared Demott, Charles Miller, and Atte Kettunen. 2017.Fuzzing for Software Security Testing and Quality Assurance, Second Edition. Artech House

  72. [72]

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision.IEEE Trans. Softw. Eng.50, 4 (April 2024), 911–936. doi:10.1109/TSE. 2024.3368208

  73. [73]

    Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Auto- mated Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engi- neering(San Francisco, CA, USA)(ESEC/FSE 2023). Association ...

  74. [74]

    Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How Effective Are Neural Networks for Fixing Security Vulnerabilities. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis(Seattle, WA, USA)(ISSTA 2023). Association for Computing Machinery, New York, ...

  75. [75]

    Hanxiang Xu, Wei Ma, Ting Zhou, Yanjie Zhao, Kai Chen, Qiang Hu, Yang Liu, and Haoyu Wang. 2024. CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge Graph. arXiv:2411.11532 [cs.SE] https://arxiv. org/abs/2411.11532

  76. [76]

    Yanjing Yang, Xin Zhou, Runfeng Mao, Jinwei Xu, Lanxin Yang, Yu Zhang, Haifeng Shen, and He Zhang. 2025. DLAP: A Deep Learning Augmented Large Language Model Prompting framework for software vulnerability detection.J. Syst. Softw.219, C (Jan. 2025), 15 pages. doi:10.1016/j.jss.2024.112234

  77. [77]

    Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang

  78. [78]

    InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria)(ISSTA 2024)

    ThinkRepair: Self-Directed Automated Program Repair. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1274–1286. doi:10.1145/3650212.3680359

  79. [79]

    Cen Zhang, Yaowen Zheng, Mingqiang Bai, Yeting Li, Wei Ma, Xiaofei Xie, Yuekang Li, Limin Sun, and Yang Liu. 2024. How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for Computing Machin...

  80. [80]

    Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. 2025. When LLMs meet cybersecurity: a systematic literature review.Cybersecurity8, 1 (2025), 55. doi:10.1186/s42400- 025-00361-w

Showing first 80 references.