arxiv: 2604.07624 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: unknown

Program Analysis Guided LLM Agent for Proof-of-Concept Generation

Achintya Desai , Md Shafiuzzaman , Wenbo Guo , Tevfik Bultan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3

classification 💻 cs.SE

keywords PoC generationLLM agentprogram analysisstatic analysissanitizer coveragevulnerability reproductionhybrid agentsoftware security

0 comments

The pith

A hybrid system that feeds lightweight static analysis and sanitizer coverage to an LLM agent raises proof-of-concept generation success by 132 percent over prior agentic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PAGENT, an approach that supplies an LLM-based agent with guidance from rule-based static analysis phases and from sanitizer-based profiling and coverage data so that the agent can generate inputs that reliably trigger reported vulnerabilities. Pure LLM agents have low success rates on this task while symbolic execution and fuzzing require expert steps and do not scale well; the hybrid method aims to overcome both limits by letting the LLM exploit accurate program signals without manual intervention. If the claim holds, developers receiving vulnerability reports could obtain working PoC inputs automatically and at higher rates than current automated techniques allow. Experiments on the described benchmarks show the hybrid system outperforming the previous best agentic baseline by 132 percent.

Core claim

The paper claims that PAGENT, by interleaving lightweight static analysis guidance and sanitizer-derived dynamic coverage information with an LLM PoC generation agent, produces a scalable hybrid method whose success rate on automated vulnerability reproduction is 132 percent higher than the strongest prior agentic approach.

What carries the argument

PAGENT, the agent that receives static-analysis rules and sanitizer coverage signals to steer its generation of trigger inputs for a given vulnerability location.

If this is right

Developers can obtain concrete trigger inputs for reported vulnerabilities with far less manual reproduction effort.
Security teams can respond to vulnerability disclosures at larger scale without relying on expert-guided symbolic or fuzzing pipelines.
The same guidance pattern could reduce the expert steps currently needed when symbolic execution or fuzzing is applied to PoC tasks.
Automated reproduction becomes feasible for projects whose size previously made manual or expert-assisted methods impractical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar static-plus-sanitizer guidance layers could be tested on LLM agents for adjacent tasks such as generating regression tests or localizing bugs.
If the guidance remains effective when the underlying sanitizer or static checker changes, the approach may transfer across different language runtimes.
The performance gain might shrink on vulnerabilities whose triggering conditions lie outside the coverage signals the sanitizers currently report.

Load-bearing premise

The lightweight static analysis phases and the sanitizer coverage data must deliver accurate, non-misleading signals that the LLM agent can reliably exploit.

What would settle it

On a held-out collection of vulnerabilities, measure whether PAGENT's PoC success rate fails to exceed the prior top agentic baseline by a substantial margin.

Figures

Figures reproduced from arXiv: 2604.07624 by Achintya Desai, Md Shafiuzzaman, Tevfik Bultan, Wenbo Guo.

**Figure 1.** Figure 1: Overview of PAGENT Technique Static Analysis Guidance ‘Vulnerability Type’: ‘Stack-Buffer-Overflow-Vulnerability’, ‘Vulnerable Function’: ‘get_register_operand’, ‘Entrypoint’: ‘LLVMFuzzerTestOneInput’, ‘Taint Path’: "[‘LLVMFuzzerTestOneInput’, ‘print_insn_tic30’, ‘print_branch.171581’, ‘get_register_operand’]", ‘Vulnerable Program Location’: ‘204’, ‘Template Assertion Violation’: ‘0 <= get_register_operand… view at source ↗

**Figure 2.** Figure 2: Static and Dynamic analysis guidance example for ARVO:18615 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Static Analysis Guidance Component [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of PoC Generation Agent Algorithm 2 PoC Generation Agent Input: Source Code 𝑆, Vulnerability Report 𝑉𝑅, Code Location 𝐶𝐿, Iteration Budget 𝐵 Output: PoC 𝑃 1: // Phase 0: Prepare Task guidance input: 𝑇𝑖 2: 𝑈 ← PROMPT("Generate the exploit PoC...") 3: 𝐼𝑡 ← TEMPLATE("You are given several files that describe a software vulnerability...") 4: 𝑒 ← Fetch_Vulnerability_Entry(𝑉𝑅, 𝐶𝐿) 5: 𝐼 ← Generate_README… view at source ↗

**Figure 5.** Figure 5: Prompt and Task-specific instruction Template [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of Dynamic Analysis Guidance Component [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: PoC success rates (%) versus agent guidance levels within PAGENT [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of PoC Success counts Claude4 GPT5 DS3.2 Claude4 + GPT5 Claude4 + DS3.2 DS3.2 + GPT5 PAGENT (DS3.1) PAGENT (DS3.2) 0 10 20 30 40 7 12 18 9 7 3 23 40 # Exclusive PoCs Baselines PAGENT (a) Comparison of exclusive PoC generation Claude4 GPT5 DS3.1 DS3.2 PAGENT (DS3.1) PAGENT (DS3.2) 0 20 40 7 8 7 8 20 32 # Successful PoCs Baselines PAGENT (b) Comparison of Post-Patch vulnerabilities [PITH_FULL_IMA… view at source ↗

**Figure 9.** Figure 9: Comparisons between exclusive PoCs and post-patch vulnerabilities across agents [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Software developers frequently receive vulnerability reports that require them to reproduce the vulnerability in a reliable manner by generating a proof-of-concept (PoC) input that triggers it. Given the source code for a software project and a specific code location for a potential vulnerability, automatically generating a PoC for the given vulnerability has been a challenging research problem. Symbolic execution and fuzzing techniques require expert guidance and manual steps and face scalability challenges for PoC generation. Although recent advances in LLMs have increased the level of automation and scalability, the success rate of PoC generation with LLMs remains quite low. In this paper, we present a novel approach called Program Analysis Guided proof of concept generation agENT (PAGENT) that is scalable and significantly improves the success rate of automated PoC generation compared to prior results. PAGENT integrates lightweight and rule-based static analysis phases for providing static analysis guidance and sanitizer-based profiling and coverage information for providing dynamic analysis guidance with a PoC generation agent. Our experiments demonstrate that the resulting hybrid approach significantly outperforms the prior top-performing agentic approach by 132% for the PoC generation task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAGENT claims 132% better PoC generation by guiding LLMs with static and dynamic analysis, but the abstract supplies no experimental details to support it.

read the letter

This paper's main takeaway is a hybrid system called PAGENT that guides an LLM agent with static analysis and dynamic sanitizer data to generate PoCs for vulnerabilities, claiming 132% better performance than previous agents. The abstract, however, gives almost no information on the experiments behind that number. PAGENT combines lightweight rule-based static analysis for guidance and sanitizer-based profiling for coverage with the LLM PoC generator. This specific setup appears new compared to the prior agentic approaches it references. The motivation is solid: LLMs alone have low success rates on this, while traditional methods like symbolic execution need too much expert input. The paper does well at laying out a practical architecture for this task. It identifies the bottleneck in vulnerability reproduction and proposes a scalable way to automate it by feeding analysis results into the agent. The soft spots are in the evaluation. The 132% figure is presented without details on the benchmark set size, the precise baseline agent, how success was defined, or any ablation studies. This leaves the claim difficult to evaluate. The assumption that the guidance signals are accurate and helpful also lacks supporting evidence in the abstract, which aligns with the concern that noisy or incomplete signals could undermine the approach. This work is for researchers and practitioners in software security who deal with automated PoC generation or LLM-based code analysis tools. A reader focused on improving practical security workflows could find the hybrid design useful to build on, provided the results hold up. I recommend sending it for peer review. The core idea is worth examining in detail, and referees can push for the missing experimental rigor.

Referee Report

2 major / 1 minor

Summary. The paper introduces PAGENT, a hybrid system that combines lightweight rule-based static analysis phases and sanitizer-based dynamic profiling/coverage information to guide an LLM-based agent in automatically generating proof-of-concept (PoC) inputs that trigger reported vulnerabilities, given source code and a target location. It claims this approach is more scalable than symbolic execution or fuzzing and significantly outperforms the prior top agentic baseline by 132% on the PoC generation task.

Significance. If the empirical results and the reliability of the guidance signals can be substantiated, the work would represent a meaningful step toward more automated and scalable vulnerability reproduction, potentially reducing the manual effort required for security triage. The core idea of using program analysis outputs as non-misleading signals to steer LLM agents is a natural and timely direction at the intersection of static/dynamic analysis and agentic LLMs.

major comments (2)

[Abstract] Abstract: the claim that the hybrid approach 'significantly outperforms the prior top-performing agentic approach by 132%' supplies no information on the number of benchmarks, the exact baseline agent, the definition of success for PoC generation, or statistical significance. Without these details the central empirical claim cannot be evaluated.
[Evaluation] Evaluation section (and associated tables/figures): no quantitative evidence is provided on the accuracy of the lightweight static-analysis warnings or the sanitizer-derived coverage signals (e.g., precision of warnings, coverage delta attributable to sanitizers, or an ablation that removes the guidance phases). If these signals contain false positives or noise that the LLM agent cannot reliably filter, the reported 132% gain cannot be attributed to the proposed guidance mechanism.

minor comments (1)

[Abstract] The abstract and title use the stylized acronym 'PAGENT' and 'agENT'; ensure consistent capitalization and expansion on first use in the body.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the paper to improve clarity and substantiate the claims where needed.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the hybrid approach 'significantly outperforms the prior top-performing agentic approach by 132%' supplies no information on the number of benchmarks, the exact baseline agent, the definition of success for PoC generation, or statistical significance. Without these details the central empirical claim cannot be evaluated.

Authors: We agree that the abstract would benefit from additional context to make the central claim self-contained. In the revised manuscript, we will expand the abstract to summarize the number of benchmarks, identify the specific prior top-performing agentic baseline used for comparison, clarify the definition of success for PoC generation (an input that triggers the vulnerability as confirmed by the sanitizer), and note the statistical significance of the 132% relative improvement. These details are already elaborated in the Evaluation section and will be condensed into the abstract for better accessibility. revision: yes
Referee: [Evaluation] Evaluation section (and associated tables/figures): no quantitative evidence is provided on the accuracy of the lightweight static-analysis warnings or the sanitizer-derived coverage signals (e.g., precision of warnings, coverage delta attributable to sanitizers, or an ablation that removes the guidance phases). If these signals contain false positives or noise that the LLM agent cannot reliably filter, the reported 132% gain cannot be attributed to the proposed guidance mechanism.

Authors: We acknowledge that the current evaluation focuses on end-to-end PoC generation success and does not include dedicated metrics or ablations for the individual guidance signals. To address this, we will add an ablation study in the revised manuscript that removes the static analysis guidance and the dynamic sanitizer-based guidance phases separately, quantifying their individual contributions to the overall success rate. We will also report the precision of the lightweight static-analysis warnings on the benchmark set and the coverage improvements attributable to the sanitizer profiling. This will help demonstrate that the LLM agent effectively leverages these signals and that the performance gains can be attributed to the hybrid guidance approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical outperformance claim is externally measured

full rationale

The paper presents PAGENT as an integration of lightweight static analysis, sanitizer-based dynamic guidance, and an LLM agent for PoC generation. Its central claim is an experimental result (132% outperformance over a prior agentic baseline). No equations, derivations, fitted parameters, or predictions appear in the provided text. The success rate is reported from head-to-head experiments rather than being forced by definition, self-citation chains, or renaming of known patterns. The approach description does not reduce to its own inputs by construction; the measured improvement is independent evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, domain-specific axioms, or invented entities are introduced or required by the abstract description.

pith-pipeline@v0.9.0 · 5500 in / 1009 out tokens · 50468 ms · 2026-05-10T17:00:18.896732+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Abramovich, M

T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press. EnigMA: Enhanced interactive generative model agent for CTF challenges. Technical report
[2]

Ahmed, J

T. Ahmed, J. Ganhotra, R. Pan, A. Shinnar, S. Sinha, and M. Hirzel. Otter: Generating tests from issues to validate swe patches.arXiv preprint arXiv:2502.05368, 2025

work page arXiv 2025
[3]

Aslanyan, H

H. Aslanyan, H. Movsisyan, H. Hovhannisyan, Z. Gevorgyan, R. Mkoyan, A. Avetisyan, and S. Sargsyan. Combining static analysis with directed symbolic execution for scalable and accurate memory leak detection.IEEE Access, 12:80128–80137, 2024

2024
[4]

Baldoni, E

R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, and I. Finocchi. A survey of symbolic execution techniques.ACM Computing Surveys (CSUR), 51(3):1–39, 2018

2018
[5]

Barrett and S

L. Barrett and S. Moore. cclyzer++: Scalable and precise pointer analysis for llvm. https://galois.com/blog/2022/08/ cclyzer-scalable-and-precise-pointer-analysis-for-llvm/, 2022

2022
[6]

Böhme, C

M. Böhme, C. Cadar, and A. Roychoudhury. Fuzzing: Challenges and reflections.IEEE Software, 38(3):79–86, 2020

2020
[7]

arXiv preprint arXiv:2403.17134 , year=

I. Bouzenia, P. Devanbu, and M. Pradel. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

work page arXiv 2024
[8]

Cadar, V

C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. Exe: Automatically generating inputs of death.ACM Transactions on Information and System Security (TISSEC), 12(2):1–38, 2008

2008
[9]

Cadar and K

C. Cadar and K. Sen. Symbolic execution for software testing: three decades later.Communications of the ACM, 56(2):82–90, 2013

2013
[10]

Cheng, K

B. Cheng, K. Wang, L. Shi, H. Wang, Y. Guo, D. Li, and X. Chen. Enhancing semantic understanding in pointer analysis using large language models. InProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages, pages 112–117, New York, NY, USA, Oct. 2025. ACM

2025
[11]

Cve metrices

CVE. Cve metrices. https://www.cve.org/about/Metrics, 2026

2026
[12]

Feng and C

S. Feng and C. Chen. Prompting is all you need: Automated android bug replay with large language models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024

2024
[13]

Godefroid, M

P. Godefroid, M. Y. Levin, and D. Molnar. Sage: whitebox fuzzing for security testing.Communications of the ACM, 55(3):40–44, 2012

2012
[14]

OSS-Fuzz: Continuous fuzzing for open source software

Google. OSS-Fuzz: Continuous fuzzing for open source software. https://google.github.io/oss-fuzz/, 2016. Accessed: 2026-01

2016
[15]

Hassler, P

K. Hassler, P. Görz, S. Lipp, T. Holz, and M. Böhme. A comparative study of fuzzers and static analysis tools for finding memory unsafety in c and c++.arXiv preprint arXiv:2505.22052, 2025

work page arXiv 2025
[16]

S. Jain, A. Dora, K. S. Sam, and P. Singh. Llm agents improve semantic code search.arXiv preprint arXiv:2408.11058, 2024

work page arXiv 2024
[17]

Jordan, B

H. Jordan, B. Scholz, and P. Subotić. Soufflé: On synthesis of program analyzers. InComputer Aided Verification: 28th International Conference, CA V 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II 28, pages 422–430. Springer, 2016

2016
[18]

S. Kang, J. Yoon, N. Askarbekkyzy, and S. Yoo. Evaluating diverse large language models for automatic and general bug reproduction.IEEE Transactions on Software Engineering, 2024

2024
[19]

G. Li, M. Sridharan, and Z. Qian. Redefining indirect call analysis with kallgraph. In2025 IEEE Symposium on Security and Privacy (SP), pages 2957–2975. IEEE, 2025

2025
[20]

L. Li, T. F. Bissyandé, M. Papadakis, S. Rasthofer, A. Bartel, D. Octeau, J. Klein, and L. Traon. Static analysis of android apps: A systematic literature review.Information and Software Technology, 88:67–95, 2017

2017
[21]

X. Lin, Y. Ning, J. Zhang, Y. Dong, Y. Liu, Y. Wu, X. Qi, N. Sun, Y. Shang, K. Wang, et al. Llm-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 2025. , Vol. 1, No. 1, Article . Publication date: April 2026. 20 Achintya Desai, Md Shafiuzzaman, Wenbo Guo, and Tevfik Bultan

work page arXiv 2025
[22]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

2023
[25]

K. Lu. Practical program modularization with type-based dependence analysis. In2023 IEEE Symposium on Security and Privacy (SP), pages 1256–1270. IEEE, 2023

2023
[26]

Lu and H

K. Lu and H. Hu. Where does it go? refining indirect-call targets with multi-layer type analysis. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1867–1881, 2019

2019
[27]

W. Lv, X. Xia, and S.-J. Huang. Codeact: Code adaptive compute-efficient tuning framework for code llms.arXiv preprint arXiv:2408.02193, 2024

work page arXiv 2024
[28]

X. Mei, P. S. Singaria, J. Del Castillo, H. Xi, T. Bao, R. Wang, Y. Shoshitaishvili, A. Doupé, H. Pearce, B. Dolan-Gavitt, et al. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153, 2024

work page arXiv 2024
[29]

Nitin, B

V. Nitin, B. Ray, and R. Z. Moghaddam. Faultline: Automated proof-of-vulnerability generation using llm agents.arXiv preprint arXiv:2507.15241, 2025

work page arXiv 2025
[30]

OpenAI Codex CLI: Lightweight coding agent for the terminal

OpenAI. OpenAI Codex CLI: Lightweight coding agent for the terminal. https://github.com/openai/codex, 2025. Accessed: 2025-05-10

2025
[31]

C. S. Păsăreanu and N. Rungta. Symbolic pathfinder: symbolic execution of java bytecode. InProceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, pages 179–180, 2010

2010
[32]

Pomian, A

D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, T. Bryksin, and D. Dig. Next-generation refactoring: Combining llm insights and ide capabilities for extract method. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 275–287. IEEE, 2024

2024
[33]

G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering, 1(FSE):951–971, 2024

2024
[34]

S. Saha, L. Sarker, M. Shafiuzzaman, C. Shou, A. Li, G. Sankaran, and T. Bultan. Rare path guided fuzzing. Inproceedings of the 32nd ACM sigsoft international symposium on software testing and analysis, pages 1295–1306, 2023

2023
[35]

K. Sen, D. Marinov, and G. Agha. Cute: A concolic unit testing engine for c.ACM SIGSOFT software engineering notes, 30(5):263–272, 2005

2005
[36]

Shafiuzzaman, A

M. Shafiuzzaman, A. Desai, L. Sarker, and T. Bultan. STASE: Static analysis guided symbolic execution for UEFI vulnerability signature generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 1783–1794, 2024

2024
[37]

Shastry, M

B. Shastry, M. Leutner, T. Fiebig, K. Thimmaraju, F. Yamaguchi, K. Rieck, S. Schmid, J.-P. Seifert, and A. Feldmann. Static program analysis as a fuzzing aid. InInternational Symposium on Research in Attacks, Intrusions, and Defenses, pages 26–47. Springer, 2017

2017
[38]

Simsek, A

D. Simsek, A. Eghbali, and M. Pradel. Pocgen: Generating proof-of-concept exploits for vulnerabilities in npm packages. arXiv preprint arXiv:2506.04962, 2025

work page arXiv 2025
[39]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[40]

Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song. Cybergym: Evaluating ai agents’ cybersecurity capabilities with real-world vulnerabilities at scale.arXiv preprint arXiv:2506.02548, 2025

work page arXiv 2025
[41]

Wüstholz and M

V. Wüstholz and M. Christakis. Targeted greybox fuzzing with static lookahead analysis. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 789–800, 2020

2020
[42]

C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

2024
[43]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[44]

Zhang, Y

W. Zhang, Y. Sun, P. Huang, J. Pu, H. Lin, and D. Song. Mirage-bench: Llm agent is hallucinating and where to find them.arXiv preprint arXiv:2507.21017, 2025

work page arXiv 2025
[45]

Zheng, Z

Y. Zheng, Z. Song, Y. Sun, K. Cheng, H. Zhu, and L. Sun. An efficient greybox fuzzing scheme for linux-based iot programs through binary static analysis. In2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC), pages 1–8. IEEE, 2019. , Vol. 1, No. 1, Article . Publication date: April 2026

2019