pith. machine review for the scientific record. sign in

arxiv: 2604.07624 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: unknown

Program Analysis Guided LLM Agent for Proof-of-Concept Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3

classification 💻 cs.SE
keywords PoC generationLLM agentprogram analysisstatic analysissanitizer coveragevulnerability reproductionhybrid agentsoftware security
0
0 comments X

The pith

A hybrid system that feeds lightweight static analysis and sanitizer coverage to an LLM agent raises proof-of-concept generation success by 132 percent over prior agentic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PAGENT, an approach that supplies an LLM-based agent with guidance from rule-based static analysis phases and from sanitizer-based profiling and coverage data so that the agent can generate inputs that reliably trigger reported vulnerabilities. Pure LLM agents have low success rates on this task while symbolic execution and fuzzing require expert steps and do not scale well; the hybrid method aims to overcome both limits by letting the LLM exploit accurate program signals without manual intervention. If the claim holds, developers receiving vulnerability reports could obtain working PoC inputs automatically and at higher rates than current automated techniques allow. Experiments on the described benchmarks show the hybrid system outperforming the previous best agentic baseline by 132 percent.

Core claim

The paper claims that PAGENT, by interleaving lightweight static analysis guidance and sanitizer-derived dynamic coverage information with an LLM PoC generation agent, produces a scalable hybrid method whose success rate on automated vulnerability reproduction is 132 percent higher than the strongest prior agentic approach.

What carries the argument

PAGENT, the agent that receives static-analysis rules and sanitizer coverage signals to steer its generation of trigger inputs for a given vulnerability location.

If this is right

  • Developers can obtain concrete trigger inputs for reported vulnerabilities with far less manual reproduction effort.
  • Security teams can respond to vulnerability disclosures at larger scale without relying on expert-guided symbolic or fuzzing pipelines.
  • The same guidance pattern could reduce the expert steps currently needed when symbolic execution or fuzzing is applied to PoC tasks.
  • Automated reproduction becomes feasible for projects whose size previously made manual or expert-assisted methods impractical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar static-plus-sanitizer guidance layers could be tested on LLM agents for adjacent tasks such as generating regression tests or localizing bugs.
  • If the guidance remains effective when the underlying sanitizer or static checker changes, the approach may transfer across different language runtimes.
  • The performance gain might shrink on vulnerabilities whose triggering conditions lie outside the coverage signals the sanitizers currently report.

Load-bearing premise

The lightweight static analysis phases and the sanitizer coverage data must deliver accurate, non-misleading signals that the LLM agent can reliably exploit.

What would settle it

On a held-out collection of vulnerabilities, measure whether PAGENT's PoC success rate fails to exceed the prior top agentic baseline by a substantial margin.

Figures

Figures reproduced from arXiv: 2604.07624 by Achintya Desai, Md Shafiuzzaman, Tevfik Bultan, Wenbo Guo.

Figure 1
Figure 1. Figure 1: Overview of PAGENT Technique Static Analysis Guidance ‘Vulnerability Type’: ‘Stack-Buffer-Overflow-Vulnerability’, ‘Vulnerable Function’: ‘get_register_operand’, ‘Entrypoint’: ‘LLVMFuzzerTestOneInput’, ‘Taint Path’: "[‘LLVMFuzzerTestOneInput’, ‘print_insn_tic30’, ‘print_branch.171581’, ‘get_register_operand’]", ‘Vulnerable Program Location’: ‘204’, ‘Template Assertion Violation’: ‘0 <= get_register_operand… view at source ↗
Figure 2
Figure 2. Figure 2: Static and Dynamic analysis guidance example for ARVO:18615 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Static Analysis Guidance Component [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of PoC Generation Agent Algorithm 2 PoC Generation Agent Input: Source Code 𝑆, Vulnerability Report 𝑉𝑅, Code Location 𝐶𝐿, Iteration Budget 𝐵 Output: PoC 𝑃 1: // Phase 0: Prepare Task guidance input: 𝑇𝑖 2: 𝑈 ← PROMPT("Generate the exploit PoC...") 3: 𝐼𝑡 ← TEMPLATE("You are given several files that describe a software vulnerability...") 4: 𝑒 ← Fetch_Vulnerability_Entry(𝑉𝑅, 𝐶𝐿) 5: 𝐼 ← Generate_README… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt and Task-specific instruction Template [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of Dynamic Analysis Guidance Component [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: PoC success rates (%) versus agent guidance levels within PAGENT [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of PoC Success counts Claude4 GPT5 DS3.2 Claude4 + GPT5 Claude4 + DS3.2 DS3.2 + GPT5 PAGENT (DS3.1) PAGENT (DS3.2) 0 10 20 30 40 7 12 18 9 7 3 23 40 # Exclusive PoCs Baselines PAGENT (a) Comparison of exclusive PoC generation Claude4 GPT5 DS3.1 DS3.2 PAGENT (DS3.1) PAGENT (DS3.2) 0 20 40 7 8 7 8 20 32 # Successful PoCs Baselines PAGENT (b) Comparison of Post-Patch vulnerabilities [PITH_FULL_IMA… view at source ↗
Figure 9
Figure 9. Figure 9: Comparisons between exclusive PoCs and post-patch vulnerabilities across agents [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Software developers frequently receive vulnerability reports that require them to reproduce the vulnerability in a reliable manner by generating a proof-of-concept (PoC) input that triggers it. Given the source code for a software project and a specific code location for a potential vulnerability, automatically generating a PoC for the given vulnerability has been a challenging research problem. Symbolic execution and fuzzing techniques require expert guidance and manual steps and face scalability challenges for PoC generation. Although recent advances in LLMs have increased the level of automation and scalability, the success rate of PoC generation with LLMs remains quite low. In this paper, we present a novel approach called Program Analysis Guided proof of concept generation agENT (PAGENT) that is scalable and significantly improves the success rate of automated PoC generation compared to prior results. PAGENT integrates lightweight and rule-based static analysis phases for providing static analysis guidance and sanitizer-based profiling and coverage information for providing dynamic analysis guidance with a PoC generation agent. Our experiments demonstrate that the resulting hybrid approach significantly outperforms the prior top-performing agentic approach by 132% for the PoC generation task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PAGENT, a hybrid system that combines lightweight rule-based static analysis phases and sanitizer-based dynamic profiling/coverage information to guide an LLM-based agent in automatically generating proof-of-concept (PoC) inputs that trigger reported vulnerabilities, given source code and a target location. It claims this approach is more scalable than symbolic execution or fuzzing and significantly outperforms the prior top agentic baseline by 132% on the PoC generation task.

Significance. If the empirical results and the reliability of the guidance signals can be substantiated, the work would represent a meaningful step toward more automated and scalable vulnerability reproduction, potentially reducing the manual effort required for security triage. The core idea of using program analysis outputs as non-misleading signals to steer LLM agents is a natural and timely direction at the intersection of static/dynamic analysis and agentic LLMs.

major comments (2)
  1. [Abstract] Abstract: the claim that the hybrid approach 'significantly outperforms the prior top-performing agentic approach by 132%' supplies no information on the number of benchmarks, the exact baseline agent, the definition of success for PoC generation, or statistical significance. Without these details the central empirical claim cannot be evaluated.
  2. [Evaluation] Evaluation section (and associated tables/figures): no quantitative evidence is provided on the accuracy of the lightweight static-analysis warnings or the sanitizer-derived coverage signals (e.g., precision of warnings, coverage delta attributable to sanitizers, or an ablation that removes the guidance phases). If these signals contain false positives or noise that the LLM agent cannot reliably filter, the reported 132% gain cannot be attributed to the proposed guidance mechanism.
minor comments (1)
  1. [Abstract] The abstract and title use the stylized acronym 'PAGENT' and 'agENT'; ensure consistent capitalization and expansion on first use in the body.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the paper to improve clarity and substantiate the claims where needed.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the hybrid approach 'significantly outperforms the prior top-performing agentic approach by 132%' supplies no information on the number of benchmarks, the exact baseline agent, the definition of success for PoC generation, or statistical significance. Without these details the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract would benefit from additional context to make the central claim self-contained. In the revised manuscript, we will expand the abstract to summarize the number of benchmarks, identify the specific prior top-performing agentic baseline used for comparison, clarify the definition of success for PoC generation (an input that triggers the vulnerability as confirmed by the sanitizer), and note the statistical significance of the 132% relative improvement. These details are already elaborated in the Evaluation section and will be condensed into the abstract for better accessibility. revision: yes

  2. Referee: [Evaluation] Evaluation section (and associated tables/figures): no quantitative evidence is provided on the accuracy of the lightweight static-analysis warnings or the sanitizer-derived coverage signals (e.g., precision of warnings, coverage delta attributable to sanitizers, or an ablation that removes the guidance phases). If these signals contain false positives or noise that the LLM agent cannot reliably filter, the reported 132% gain cannot be attributed to the proposed guidance mechanism.

    Authors: We acknowledge that the current evaluation focuses on end-to-end PoC generation success and does not include dedicated metrics or ablations for the individual guidance signals. To address this, we will add an ablation study in the revised manuscript that removes the static analysis guidance and the dynamic sanitizer-based guidance phases separately, quantifying their individual contributions to the overall success rate. We will also report the precision of the lightweight static-analysis warnings on the benchmark set and the coverage improvements attributable to the sanitizer profiling. This will help demonstrate that the LLM agent effectively leverages these signals and that the performance gains can be attributed to the hybrid guidance approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical outperformance claim is externally measured

full rationale

The paper presents PAGENT as an integration of lightweight static analysis, sanitizer-based dynamic guidance, and an LLM agent for PoC generation. Its central claim is an experimental result (132% outperformance over a prior agentic baseline). No equations, derivations, fitted parameters, or predictions appear in the provided text. The success rate is reported from head-to-head experiments rather than being forced by definition, self-citation chains, or renaming of known patterns. The approach description does not reduce to its own inputs by construction; the measured improvement is independent evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, domain-specific axioms, or invented entities are introduced or required by the abstract description.

pith-pipeline@v0.9.0 · 5500 in / 1009 out tokens · 50468 ms · 2026-05-10T17:00:18.896732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Abramovich, M

    T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press. EnigMA: Enhanced interactive generative model agent for CTF challenges. Technical report

  2. [2]

    Ahmed, J

    T. Ahmed, J. Ganhotra, R. Pan, A. Shinnar, S. Sinha, and M. Hirzel. Otter: Generating tests from issues to validate swe patches.arXiv preprint arXiv:2502.05368, 2025

  3. [3]

    Aslanyan, H

    H. Aslanyan, H. Movsisyan, H. Hovhannisyan, Z. Gevorgyan, R. Mkoyan, A. Avetisyan, and S. Sargsyan. Combining static analysis with directed symbolic execution for scalable and accurate memory leak detection.IEEE Access, 12:80128–80137, 2024

  4. [4]

    Baldoni, E

    R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, and I. Finocchi. A survey of symbolic execution techniques.ACM Computing Surveys (CSUR), 51(3):1–39, 2018

  5. [5]

    Barrett and S

    L. Barrett and S. Moore. cclyzer++: Scalable and precise pointer analysis for llvm. https://galois.com/blog/2022/08/ cclyzer-scalable-and-precise-pointer-analysis-for-llvm/, 2022

  6. [6]

    Böhme, C

    M. Böhme, C. Cadar, and A. Roychoudhury. Fuzzing: Challenges and reflections.IEEE Software, 38(3):79–86, 2020

  7. [7]

    arXiv preprint arXiv:2403.17134 , year=

    I. Bouzenia, P. Devanbu, and M. Pradel. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

  8. [8]

    Cadar, V

    C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. Exe: Automatically generating inputs of death.ACM Transactions on Information and System Security (TISSEC), 12(2):1–38, 2008

  9. [9]

    Cadar and K

    C. Cadar and K. Sen. Symbolic execution for software testing: three decades later.Communications of the ACM, 56(2):82–90, 2013

  10. [10]

    Cheng, K

    B. Cheng, K. Wang, L. Shi, H. Wang, Y. Guo, D. Li, and X. Chen. Enhancing semantic understanding in pointer analysis using large language models. InProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages, pages 112–117, New York, NY, USA, Oct. 2025. ACM

  11. [11]

    Cve metrices

    CVE. Cve metrices. https://www.cve.org/about/Metrics, 2026

  12. [12]

    Feng and C

    S. Feng and C. Chen. Prompting is all you need: Automated android bug replay with large language models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024

  13. [13]

    Godefroid, M

    P. Godefroid, M. Y. Levin, and D. Molnar. Sage: whitebox fuzzing for security testing.Communications of the ACM, 55(3):40–44, 2012

  14. [14]

    OSS-Fuzz: Continuous fuzzing for open source software

    Google. OSS-Fuzz: Continuous fuzzing for open source software. https://google.github.io/oss-fuzz/, 2016. Accessed: 2026-01

  15. [15]

    Hassler, P

    K. Hassler, P. Görz, S. Lipp, T. Holz, and M. Böhme. A comparative study of fuzzers and static analysis tools for finding memory unsafety in c and c++.arXiv preprint arXiv:2505.22052, 2025

  16. [16]

    S. Jain, A. Dora, K. S. Sam, and P. Singh. Llm agents improve semantic code search.arXiv preprint arXiv:2408.11058, 2024

  17. [17]

    Jordan, B

    H. Jordan, B. Scholz, and P. Subotić. Soufflé: On synthesis of program analyzers. InComputer Aided Verification: 28th International Conference, CA V 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II 28, pages 422–430. Springer, 2016

  18. [18]

    S. Kang, J. Yoon, N. Askarbekkyzy, and S. Yoo. Evaluating diverse large language models for automatic and general bug reproduction.IEEE Transactions on Software Engineering, 2024

  19. [19]

    G. Li, M. Sridharan, and Z. Qian. Redefining indirect call analysis with kallgraph. In2025 IEEE Symposium on Security and Privacy (SP), pages 2957–2975. IEEE, 2025

  20. [20]

    L. Li, T. F. Bissyandé, M. Papadakis, S. Rasthofer, A. Bartel, D. Octeau, J. Klein, and L. Traon. Static analysis of android apps: A systematic literature review.Information and Software Technology, 88:67–95, 2017

  21. [21]

    X. Lin, Y. Ning, J. Zhang, Y. Dong, Y. Liu, Y. Wu, X. Qi, N. Sun, Y. Shang, K. Wang, et al. Llm-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 2025. , Vol. 1, No. 1, Article . Publication date: April 2026. 20 Achintya Desai, Md Shafiuzzaman, Wenbo Guo, and Tevfik Bultan

  22. [22]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  23. [23]

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  24. [24]

    J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

  25. [25]

    K. Lu. Practical program modularization with type-based dependence analysis. In2023 IEEE Symposium on Security and Privacy (SP), pages 1256–1270. IEEE, 2023

  26. [26]

    Lu and H

    K. Lu and H. Hu. Where does it go? refining indirect-call targets with multi-layer type analysis. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1867–1881, 2019

  27. [27]

    W. Lv, X. Xia, and S.-J. Huang. Codeact: Code adaptive compute-efficient tuning framework for code llms.arXiv preprint arXiv:2408.02193, 2024

  28. [28]

    X. Mei, P. S. Singaria, J. Del Castillo, H. Xi, T. Bao, R. Wang, Y. Shoshitaishvili, A. Doupé, H. Pearce, B. Dolan-Gavitt, et al. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153, 2024

  29. [29]

    Nitin, B

    V. Nitin, B. Ray, and R. Z. Moghaddam. Faultline: Automated proof-of-vulnerability generation using llm agents.arXiv preprint arXiv:2507.15241, 2025

  30. [30]

    OpenAI Codex CLI: Lightweight coding agent for the terminal

    OpenAI. OpenAI Codex CLI: Lightweight coding agent for the terminal. https://github.com/openai/codex, 2025. Accessed: 2025-05-10

  31. [31]

    C. S. Păsăreanu and N. Rungta. Symbolic pathfinder: symbolic execution of java bytecode. InProceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, pages 179–180, 2010

  32. [32]

    Pomian, A

    D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, T. Bryksin, and D. Dig. Next-generation refactoring: Combining llm insights and ide capabilities for extract method. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 275–287. IEEE, 2024

  33. [33]

    G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering, 1(FSE):951–971, 2024

  34. [34]

    S. Saha, L. Sarker, M. Shafiuzzaman, C. Shou, A. Li, G. Sankaran, and T. Bultan. Rare path guided fuzzing. Inproceedings of the 32nd ACM sigsoft international symposium on software testing and analysis, pages 1295–1306, 2023

  35. [35]

    K. Sen, D. Marinov, and G. Agha. Cute: A concolic unit testing engine for c.ACM SIGSOFT software engineering notes, 30(5):263–272, 2005

  36. [36]

    Shafiuzzaman, A

    M. Shafiuzzaman, A. Desai, L. Sarker, and T. Bultan. STASE: Static analysis guided symbolic execution for UEFI vulnerability signature generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 1783–1794, 2024

  37. [37]

    Shastry, M

    B. Shastry, M. Leutner, T. Fiebig, K. Thimmaraju, F. Yamaguchi, K. Rieck, S. Schmid, J.-P. Seifert, and A. Feldmann. Static program analysis as a fuzzing aid. InInternational Symposium on Research in Attacks, Intrusions, and Defenses, pages 26–47. Springer, 2017

  38. [38]

    Simsek, A

    D. Simsek, A. Eghbali, and M. Pradel. Pocgen: Generating proof-of-concept exploits for vulnerabilities in npm packages. arXiv preprint arXiv:2506.04962, 2025

  39. [39]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  40. [40]

    Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song. Cybergym: Evaluating ai agents’ cybersecurity capabilities with real-world vulnerabilities at scale.arXiv preprint arXiv:2506.02548, 2025

  41. [41]

    Wüstholz and M

    V. Wüstholz and M. Christakis. Targeted greybox fuzzing with static lookahead analysis. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 789–800, 2020

  42. [42]

    C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

  43. [43]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  44. [44]

    Zhang, Y

    W. Zhang, Y. Sun, P. Huang, J. Pu, H. Lin, and D. Song. Mirage-bench: Llm agent is hallucinating and where to find them.arXiv preprint arXiv:2507.21017, 2025

  45. [45]

    Zheng, Z

    Y. Zheng, Z. Song, Y. Sun, K. Cheng, H. Zhu, and L. Sun. An efficient greybox fuzzing scheme for linux-based iot programs through binary static analysis. In2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC), pages 1–8. IEEE, 2019. , Vol. 1, No. 1, Article . Publication date: April 2026