Contextualizing Sink Knowledge for Java Vulnerability Discovery

Cen Zhang; Fabian Fleischer; Jeongin Cho; Joonun Jang; Meng Xu; Taesoo Kim

arxiv: 2604.01645 · v3 · submitted 2026-04-02 · 💻 cs.CR

Contextualizing Sink Knowledge for Java Vulnerability Discovery

Fabian Fleischer , Cen Zhang , Joonun Jang , Jeongin Cho , Meng Xu , Taesoo Kim This is my paper

Pith reviewed 2026-05-13 21:45 UTC · model grok-4.3

classification 💻 cs.CR

keywords Java vulnerability discoverysink-centric fuzzingCWE scanningLLM-assisted filteringexploration agentexploitation agentproof-of-concept exploitscoverage-guided fuzzer

0 comments

The pith

GONDAR discovers four times more Java vulnerabilities than Jazzer by targeting sink APIs with LLM filtering and collaborative agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a fuzzing method that focuses on security-sensitive API calls called sinks in Java code. These sinks carry key details about how to reach them and how to trigger exploits. GONDAR first locates reachable sinks using CWE-based scans plus LLM static checks, then runs two agents alongside a coverage fuzzer. One agent solves paths to reach the sinks while the other builds working exploits. The components share seeds and feedback in real time.

Core claim

GONDAR is a sink-centric fuzzing framework that systematically leverages sink API semantics for targeted vulnerability discovery. It identifies reachable and exploitable sink call sites through CWE-specific scanning combined with LLM-assisted static filtering, then deploys an exploration agent to generate inputs that reach target call sites by solving path constraints and an exploitation agent to synthesize proof-of-concept exploits by satisfying vulnerability-triggering conditions. The agents and fuzzer continuously exchange seeds and runtime feedback.

What carries the argument

Two specialized agents—an exploration agent that solves path constraints to reach sink call sites and an exploitation agent that reasons about vulnerability conditions—working collaboratively with a coverage-guided fuzzer.

Load-bearing premise

LLM-assisted static filtering combined with CWE scanning can reliably identify reachable and exploitable sink call sites without excessive false positives or missed targets.

What would settle it

Running GONDAR and Jazzer side-by-side on the same real-world Java benchmarks and counting the distinct vulnerabilities each finds; a ratio close to one would falsify the four-times improvement claim.

Figures

Figures reproduced from arXiv: 2604.01645 by Cen Zhang, Fabian Fleischer, Jeongin Cho, Joonun Jang, Meng Xu, Taesoo Kim.

**Figure 1.** Figure 1: Command injection vulnerability in Jenkins from AIxCC semifinal exemplar. The vulnerability requires satisfying multiple conditions to reach the ProcessBuilder sink (line 20) and specific input properties to trigger exploitation (line 23). guide. ③ Complex exploitation logic (18/21): this represents the most common failure mode. Exploitation required reasoning about intricate conditions, multiple API inte… view at source ↗

**Figure 2.** Figure 2: Overall design of GONDAR. Robot icons indicate LLM-based components. value must match the SHA-256 hash of "breakin the law". When both conditions hold, execution invokes createUtils, which constructs a ProcessBuilder with attacker-controlled arguments. The ProcessBuilder constructor at line 20 is the sink, i.e., a security-sensitive API where attackercontrollable arguments enable command execution. Jazzer… view at source ↗

**Figure 3.** Figure 3: Coordinate diagram showing the relationship between vulnerabilities reached and exploited for different tools and configurations. retains 52 out of 54 expected vulnerabilities; two false negatives stem from Joern’s inability to resolve reflective calls and lambdas, a general limitation of Java static analysis. LLM cost scales with analyzed sinks, with command injection most expensive due to iterative pay… view at source ↗

**Figure 4.** Figure 4: Spider chart showing normalized reached and exploited vulnerabilities per CWE type for baseline settings and different configurations of GONDAR. The value at each axis is calculated as the number of vulnerabilities reached/exploited divided by the total number of vulnerabilities for that CWE type. Fuzzing Cost vs. LLM Cost. Since both large-scale fuzzing and LLM usage can incur significant costs, we analyz… view at source ↗

**Figure 5.** Figure 5: Per-vulnerability matrix showing reached (light) and exploited (dark) status for each of the 54 vulnerabilities across all configurations. Takeaway 3: Open-weight models achieve near-flagship effectiveness at a fraction of the cost: GLM-5 exploits 35 vulnerabilities at $392 (8× cheaper than flagships), offering a viable option for cost-sensitive deployments or privacy-constrained environments that requir… view at source ↗

**Figure 6.** Figure 6: Jazzer coverage over time per project [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Java applications are prone to vulnerabilities stemming from the insecure use of security-sensitive APIs, such as file operations enabling path traversal or deserialization routines allowing remote code execution. These sink APIs encode critical information for vulnerability discovery: the program-specific constraints required to reach them and the exploitation conditions necessary to trigger security flaws. Despite this, existing fuzzers largely overlook such vulnerability-specific knowledge, limiting their effectiveness. We present GONDAR, a sink-centric fuzzing framework that systematically leverages sink API semantics for targeted vulnerability discovery. GONDAR first identifies reachable and exploitable sink call sites through CWE-specific scanning combined with LLM-assisted static filtering. It then deploys two specialized agents that work collaboratively with a coverage-guided fuzzer: an exploration agent generates inputs to reach target call sites by iteratively solving path constraints, while an exploitation agent synthesizes proof-of-concept exploits by reasoning about and satisfying vulnerability-triggering conditions. The agents and fuzzer continuously exchange seeds and runtime feedback, complementing each other. We evaluated GONDAR on real-world Java benchmarks, where it discovers four times more vulnerabilities than Jazzer, the state-of-the-art Java fuzzer. Notably, an earlier GONDAR version contributed to Team Atlanta's first-place CRS in the DARPA AI Cyber Challenge, and is integrated into OSS-CRS, a sandbox project in The Linux Foundation's OpenSSF, to analyze open-source Java projects, where it has already uncovered a zero-day vulnerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GONDAR's dual-agent sink targeting plus LLM filtering is a reasonable engineering step forward for Java fuzzing, but the 4x claim and zero-day rest on an unmeasured filtering stage that the abstract leaves unverified.

read the letter

The paper introduces GONDAR, a framework that first scans for CWE-related sinks, uses an LLM to filter reachable and exploitable ones, then runs an exploration agent to reach them and an exploitation agent to craft triggers, all feeding back into a coverage fuzzer. That two-agent split with seed exchange is the concrete new piece; prior work has used LLMs or constraint solving separately, but the collaborative loop around sink knowledge is not in the cited baselines. The integration into OSS-CRS and the DARPA CRS win shows the system runs on real code and has produced at least one zero-day, which is useful evidence of practicality. The evaluation claims four times more vulnerabilities than Jazzer on real-world Java benchmarks. That number is the headline result, yet the abstract gives no precision or recall for the LLM filtering step, no manual audit of the selected sinks, and no breakdown of how vulnerabilities were counted or which benchmarks were used. Without those numbers the improvement cannot be cleanly attributed to the sink-centric design rather than to better seed selection or other unstated factors. The stress-test concern holds on the supplied text: the central claim depends on an unquantified early stage. If the full paper supplies a ground-truth sink audit or ablation that isolates the filtering contribution, the result strengthens; otherwise the 4x figure stays conditional. This work is aimed at researchers and practitioners building automated security tools for Java. It is worth sending to peer review so referees can check the implementation details, the exact benchmark set, and whether the filtering metrics support the performance claims. The architecture is clear enough that a serious review would be productive even if revisions are needed on the evaluation.

Referee Report

2 major / 2 minor

Summary. The paper presents GONDAR, a sink-centric fuzzing framework for Java vulnerability discovery. It first identifies reachable and exploitable sink call sites via CWE-specific scanning plus LLM-assisted static filtering, then deploys an exploration agent (to reach targets by solving path constraints) and an exploitation agent (to synthesize PoC exploits by satisfying vulnerability conditions) that collaborate with a coverage-guided fuzzer through seed and feedback exchange. On real-world Java benchmarks the framework is reported to discover four times more vulnerabilities than Jazzer; an earlier version contributed to a first-place DARPA AI Cyber Challenge result and has been integrated into OSS-CRS where it found a zero-day.

Significance. If the headline performance claims are substantiated with proper controls and metrics, the work would demonstrate that explicit sink-semantic knowledge can materially improve directed fuzzing effectiveness over general-purpose tools, with immediate practical value shown by the DARPA placement and zero-day discovery. The agent-based decomposition of reachability and exploit synthesis is a concrete engineering contribution that could be adopted or extended by other fuzzing pipelines.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the central claim of a 4× improvement over Jazzer is presented without any description of benchmark selection criteria, vulnerability counting methodology, statistical significance testing, or comparison controls. Because the headline result is load-bearing for the paper’s contribution, these omissions prevent verification that the gains are attributable to sink-centric targeting rather than other factors.
[Approach / Sink Identification] Sink-identification pipeline (described in the approach section): no precision, recall, or false-positive rate is reported for the CWE-specific scanning combined with LLM-assisted static filtering. This step is the prerequisite that supplies targets to both agents; without quantified reliability on the same benchmarks used for the 4× claim, the attribution of improved vulnerability discovery to “contextualizing sink knowledge” remains unverified.

minor comments (2)

[Approach] The interaction protocol between the exploration agent, exploitation agent, and the underlying fuzzer is described at a high level; a diagram or pseudocode listing the seed-exchange and feedback loop would improve clarity.
[Abstract / Evaluation] The abstract states that GONDAR “has already uncovered a zero-day,” but the main text does not appear to provide a case study or CVE reference for that finding; adding a brief description would strengthen the practical-impact claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. Below we provide point-by-point responses to the major comments. We will revise the manuscript to incorporate additional details and evaluations as outlined.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim of a 4× improvement over Jazzer is presented without any description of benchmark selection criteria, vulnerability counting methodology, statistical significance testing, or comparison controls. Because the headline result is load-bearing for the paper’s contribution, these omissions prevent verification that the gains are attributable to sink-centric targeting rather than other factors.

Authors: We acknowledge that the abstract is concise and does not include these methodological details. The Evaluation section of the manuscript describes the benchmarks as real-world Java applications and reports the number of vulnerabilities discovered, but we agree that more explicit criteria and methodology would strengthen the claims. In the revised manuscript, we will expand the Evaluation section with a subsection on 'Benchmark and Evaluation Methodology' that details: (1) benchmark selection criteria (e.g., open-source projects with known security issues, varying sizes from 10k to 100k LOC), (2) vulnerability counting as the number of unique, reproducible security violations (confirmed via stack traces or CVE matching) found within a 24-hour budget per tool, (3) note on the absence of statistical significance testing due to the high cost of repeated experiments but consistency across runs, and (4) additional controls including runs of Jazzer augmented with our identified sinks. This will better substantiate that the improvements stem from the sink-centric approach. revision: yes
Referee: [Approach / Sink Identification] Sink-identification pipeline (described in the approach section): no precision, recall, or false-positive rate is reported for the CWE-specific scanning combined with LLM-assisted static filtering. This step is the prerequisite that supplies targets to both agents; without quantified reliability on the same benchmarks used for the 4× claim, the attribution of improved vulnerability discovery to “contextualizing sink knowledge” remains unverified.

Authors: We agree that empirical validation of the sink identification pipeline is crucial. The Approach section details the CWE-specific scanning rules and the LLM-assisted filtering process, including the prompts used. However, we did not report precision and recall because establishing ground truth for sink reachability and exploitability across the entire benchmark suite would require extensive manual effort. For the revision, we will add an 'Accuracy of Sink Identification' subsection in the Evaluation, where we select a random sample of 100 methods from the benchmarks, manually determine reachable and exploitable sinks, and compute precision, recall, and F1 scores for the CWE scanner alone and the combined pipeline. This will provide the necessary quantification and support the attribution of performance gains to our sink knowledge contextualization. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework evaluated against external baseline

full rationale

The paper describes GONDAR as a practical fuzzing pipeline that identifies sinks via CWE scanning plus LLM-assisted static filtering, then deploys exploration and exploitation agents alongside a coverage-guided fuzzer. No equations, fitted parameters, or self-referential derivations appear in the provided text. The headline performance claim (4x vulnerabilities vs. Jazzer plus a zero-day) is measured directly against an external tool on real-world benchmarks, making the result falsifiable outside any internal construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The pipeline steps are presented as implementation choices whose effectiveness is assessed empirically rather than by definition or renaming of prior results. This satisfies the criteria for a self-contained artifact with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that LLM static filtering produces high-quality sink targets and that the agents can solve path constraints and vulnerability conditions in practice; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption LLM-assisted static filtering can accurately distinguish reachable and exploitable sink call sites from false positives
Invoked in the first step of GONDAR to select targets for the agents

invented entities (2)

Exploration agent no independent evidence
purpose: Iteratively solve path constraints to reach target sink call sites
New component introduced to complement the coverage-guided fuzzer
Exploitation agent no independent evidence
purpose: Synthesize proof-of-concept exploits by satisfying vulnerability-triggering conditions
New component introduced to complement the coverage-guided fuzzer

pith-pipeline@v0.9.0 · 5564 in / 1497 out tokens · 34109 ms · 2026-05-13T21:45:38.759653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Apache Log4j vulnerability guidance

CISA, “Apache Log4j vulnerability guidance.” https://www.cisa.gov /news-events/news/apache-log4j-vulnerability-guidance, 2021

work page 2021
[2]

Jazzer: Coverage-guided, in-process fuzzing for the JVM

Code Intelligence, “Jazzer: Coverage-guided, in-process fuzzing for the JVM.” https://github.com/CodeIntelligenceTesting/jazzer, 2021

work page 2021
[3]

JQF: Coverage-guided property- based testing in Java,

R. Padhye, C. Lemieux, and K. Sen, “JQF: Coverage-guided property- based testing in Java,” inProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, (New York, NY , USA), p. 398–401, Association for Computing Machinery, 2019

work page 2019
[4]

FUGIO: Automatic exploit generation for PHP object injection vulnerabilities,

S. Park, D. Kim, S. Jana, and S. Son, “FUGIO: Automatic exploit generation for PHP object injection vulnerabilities,” in31st USENIX Security Symposium (USENIX Security 22), (Boston, MA), USENIX Association, 2022

work page 2022
[5]

ODDFuzz: Discovering Java deserialization vulnerabilities via structure-aware directed greybox fuzzing,

S. Cao, B. He, X. Sun, Y . Ouyang, C. Zhang, X. Wu, T. Su, L. Bo, B. Li, C. Ma, J. Li, and T. Wei, “ODDFuzz: Discovering Java deserialization vulnerabilities via structure-aware directed greybox fuzzing,” in2023 IEEE Symposium on Security and Privacy (SP), (Los Alamitos, CA, USA), pp. 2726–2743, IEEE Computer Society, May 2023

work page 2023
[6]

Where URLs become weapons: Automated discovery of SSRF vulnerabilities in web applications,

E. Wang, J. Chen, W. Xie, C. Wang, Y . Gao, Z. Wang, H. Duan, Y . Liu, and B. Wang, “Where URLs become weapons: Automated discovery of SSRF vulnerabilities in web applications,” in2024 IEEE Symposium on Security and Privacy (SP), pp. 239–257, 2024

work page 2024
[7]

Remote code execution from SSTI in the sandbox: Automatically detecting and exploiting template escape bugs,

Y . Zhao, Y . Zhang, and M. Yang, “Remote code execution from SSTI in the sandbox: Automatically detecting and exploiting template escape bugs,” inProceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, (USA), USENIX Association, 2023

work page 2023
[8]

Efficient detection of Java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction,

B. Chen, L. Zhang, X. Huang, Y . Cao, K. Lian, Y . Zhang, and M. Yang, “Efficient detection of Java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction,” in 2024 IEEE Symposium on Security and Privacy (SP), pp. 3961–3978, 2024

work page 2024
[9]

Huang, L

X. Huang, L. Zhang, Y . Liu, P. Deng, Y . Cao, Y . Zhang, and M. Yang, Towards automatic detection and exploitation of Java web application vulnerabilities via concolic execution guided by cross-thread object manipulation. USA: USENIX Association, 2025

work page 2025
[10]

Z. Lin, Y . Zhang, J. Dai, X. Huang, B. Xiang, G. Yang, L. Yuan, L. Zhang, F. Liu, T. Chen, and M. Yang,Effective directed fuzzing with hierarchical scheduling for web vulnerability detection. USA: USENIX Association, 2025

work page 2025
[11]

Toss a fault to your Witcher: Applying grey-box coverage-guided mutational fuzzing to detect SQL and command injection vulnerabilities,

E. Trickel, F. Pagani, C. Zhu, L. Dresel, G. Vigna, C. Kruegel, R. Wang, T. Bao, Y . Shoshitaishvili, and A. Doup ´e, “Toss a fault to your Witcher: Applying grey-box coverage-guided mutational fuzzing to detect SQL and command injection vulnerabilities,” in2023 IEEE Symposium on Security and Privacy (SP), pp. 2658–2675, 2023

work page 2023
[12]

Atropos: Effective fuzzing of web applications for server-side vulnerabilities,

E. G ¨uler, S. Schumilo, M. Schloegel, N. Bars, P. G ¨orz, X. Xu, C. Kaygusuz, and T. Holz, “Atropos: Effective fuzzing of web applications for server-side vulnerabilities,” inProceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, (USA), USENIX Association, 2024

work page 2024
[13]

Predator: Directed web application fuzzing for efficient vulnerability validation,

C. Wang, W. Meng, C. Luo, and P. Li, “Predator: Directed web application fuzzing for efficient vulnerability validation,” in2025 IEEE Symposium on Security and Privacy (SP), pp. 886–902, 2025

work page 2025
[14]

Semantic code analysis engine

GitHub, “CodeQL.” https://github.com/github/codeql. Semantic code analysis engine

work page
[15]

OSS-CRS: Open Source Cyber Reasoning System

OpenSSF, “OSS-CRS: Open Source Cyber Reasoning System.” https: //github.com/ossf/oss-crs, 2025

work page 2025
[16]

Jazzer sanitizers

Code Intelligence, “Jazzer sanitizers.” https://github.com/CodeIntelli genceTesting/jazzer/tree/main/sanitizers/src/main/java/com/code int elligence/jazzer/sanitizers, 2021

work page 2021
[17]

libFuzzer – value profile

LLVM Project, “libFuzzer – value profile.” https://llvm.org/docs/Li bFuzzer.html#value-profile. Accessed: 2025

work page 2025
[18]

OSS-Fuzz: Continuous fuzzing for open source software

Google, “OSS-Fuzz: Continuous fuzzing for open source software.” https://github.com/google/oss-fuzz, 2016

work page 2016
[19]

LangChain

H. Chase, “LangChain.” https://github.com/langchain-ai/langchain, Oct. 2022

work page 2022
[20]

LangGraph

LangChain AI, “LangGraph.” https://github.com/langchain-ai/langg raph, 2024

work page 2024
[21]

Joern: The bug hunter’s workbench

joern.io, “Joern: The bug hunter’s workbench.” https://github.com/j oernio/joern, 2024

work page 2024
[22]

Optimization of object-oriented programs using static class hierarchy analysis,

J. Dean, D. Grove, and C. Chambers, “Optimization of object-oriented programs using static class hierarchy analysis,” inECOOP’95 — Object-Oriented Programming, 9th European Conference, ˚Aarhus, Denmark, August 7–11, 1995(M. Tokoro and R. Pareschi, eds.), (Berlin, Heidelberg), pp. 77–101, Springer Berlin Heidelberg, 1995

work page 1995
[23]

Fast static analysis of C++ virtual function calls,

D. F. Bacon and P. F. Sweeney, “Fast static analysis of C++ virtual function calls,”SIGPLAN Not., vol. 31, p. 324–341, Oct. 1996

work page 1996
[24]

IRIS: LLM-assisted static analysis for detecting security vulnerabilities,

Z. Li, S. Dutta, and M. Naik, “IRIS: LLM-assisted static analysis for detecting security vulnerabilities,” inInternational Conference on Learning Representations(Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, eds.), vol. 2025, pp. 35735–35758, 2025

work page 2025
[25]

DARPA AI Cyber Challenge (AIxCC)

Defense Advanced Research Projects Agency (DARPA), “DARPA AI Cyber Challenge (AIxCC).” https://aicyberchallenge.com/, 2023. Accessed: 2025-11-13

work page 2023
[26]

Amazon EC2 on-demand pricing

Amazon Web Services, “Amazon EC2 on-demand pricing.” https: //aws.amazon.com/ec2/pricing/on-demand/. Accessed: 2025-11-13

work page 2025
[27]

SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Archi- tectures, and Lessons Learned,

C. Zhang, Y . Park, F. Fleischer, Y .-F. Fu, J. Kim, D. Kim, Y . Kim, Q. Xu, A. Chin, Z. Sheng, H. Zhao, B. J. Lee, J. Wang, M. Pel- ican, D. J. Musliner, J. Huang, J. Silliman, M. Mcdaniel, J. Casa- vant, I. Goldthwaite, N. Vidovich, M. Lehman, and T. Kim, “SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Archi- tectures, and Lessons Learned,” 2026

work page 2026
[28]

OSS- Fuzz-Gen: Automated fuzz target generation

D. Liu, O. Chang, J. Metzman, M. Sablotny, and M. Maruseac, “OSS- Fuzz-Gen: Automated fuzz target generation.” https://github.com/goo gle/oss-fuzz-gen, 2024

work page 2024
[29]

ysoserial: A proof-of-concept tool for generating payloads that exploit unsafe Java object deserializa- tion

C. Frohoff, G. Lawrence,et al., “ysoserial: A proof-of-concept tool for generating payloads that exploit unsafe Java object deserializa- tion.” https://github.com/frohoff/ysoserial, 2015

work page 2015
[30]

recheck: The trustworthy ReDoS checker

Makenowjust Labs, “recheck: The trustworthy ReDoS checker.” https: //makenowjust-labs.github.io/recheck/. Accessed: 2025

work page 2025
[31]

Atheris: A coverage-guided, native Python fuzzer

Google, “Atheris: A coverage-guided, native Python fuzzer.” https: //github.com/google/atheris, 2020

work page 2020
[32]

go-fuzz: Randomized testing for Go

D. Vyukov, “go-fuzz: Randomized testing for Go.” https://github.c om/dvyukov/go-fuzz, 2015

work page 2015
[33]

A hybrid analysis to detect Java se- rialisation vulnerabilities,

S. Rasheed and J. Dietrich, “A hybrid analysis to detect Java se- rialisation vulnerabilities,” in2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1209– 1213, 2020

work page 2020
[34]

Improving Java deserialization gadget chain mining via overriding-guided object generation,

S. Cao, X. Sun, X. Wu, L. Bo, B. Li, R. Wu, W. Liu, B. He, Y . Ouyang, and J. Li, “Improving Java deserialization gadget chain mining via overriding-guided object generation,” inProceedings of the 45th International Conference on Software Engineering, ICSE ’23, p. 397–409, IEEE Press, 2023

work page 2023
[35]

Automated discovery of deserialization gadget chains,

I. Haken, “Automated discovery of deserialization gadget chains,” in Black Hat USA, 2018

work page 2018
[36]

Acquirer: A hybrid approach to detecting algorithmic complexity vulnerabilities,

Y . Liu and W. Meng, “Acquirer: A hybrid approach to detecting algorithmic complexity vulnerabilities,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Secu- rity, CCS ’22, (New York, NY , USA), p. 2071–2084, Association for Computing Machinery, 2022

work page 2022
[37]

HotFuzz: Discovering temporal and spatial denial-of-service vulnerabilities through guided micro-fuzzing,

W. Blair, A. Mambretti, S. Arshad, M. Weissbacher, W. Robertson, E. Kirda, and M. Egele, “HotFuzz: Discovering temporal and spatial denial-of-service vulnerabilities through guided micro-fuzzing,”ACM Trans. Priv. Secur., vol. 25, July 2022

work page 2022
[38]

UChecker: Automatically detecting PHP-based unrestricted file upload vulnerabilities,

J. Huang, Y . Li, J. Zhang, and R. Dai, “UChecker: Automatically detecting PHP-based unrestricted file upload vulnerabilities,” in2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 581–592, 2019

work page 2019
[39]

FUSE: Finding file upload bugs via penetration testing,

T. Lee, S. Wi, S. Lee, and S. Son, “FUSE: Finding file upload bugs via penetration testing,” inNetwork and Distributed System Security Symposium (NDSS), 2020

work page 2020
[40]

Undefined-oriented programming: De- tecting and chaining prototype pollution gadgets in Node.js template engines for malicious consequences,

Z. Liu, K. An, and Y . Cao, “Undefined-oriented programming: De- tecting and chaining prototype pollution gadgets in Node.js template engines for malicious consequences,” in2024 IEEE Symposium on Security and Privacy (SP), pp. 4015–4033, 2024

work page 2024
[41]

Detecting Node.js prototype pollution vulnerabilities via object lookup analysis,

S. Li, M. Kang, J. Hou, and Y . Cao, “Detecting Node.js prototype pollution vulnerabilities via object lookup analysis,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, (New York, NY , USA), p. 268–279, Association for Computing Machinery, 2021

work page 2021
[42]

Silent spring: Pro- totype pollution leads to remote code execution in Node.js,

M. Shcherbakov, M. Balliu, and C.-A. Staicu, “Silent spring: Pro- totype pollution leads to remote code execution in Node.js,” in Proceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, (USA), USENIX Association, 2023

work page 2023
[43]

webFuzz: Grey-box fuzzing for web ap- plications,

O. van Rooij, M. A. Charalambous, D. Kaizer, M. Papaevripides, and E. Athanasopoulos, “webFuzz: Grey-box fuzzing for web ap- plications,” inComputer Security – ESORICS 2021: 26th European Symposium on Research in Computer Security, Darmstadt, Ger- many, October 4–8, 2021, Proceedings, Part I, (Berlin, Heidelberg), p. 152–172, Springer-Verlag, 2021

work page 2021
[44]

Backrest: A model-based feedback-driven greybox fuzzer for web applications,

F. Gauthier, B. Hassanshahi, B. Selwyn-Smith, T. N. Mai, M. Schl¨uter, and M. Williams, “Backrest: A model-based feedback-driven greybox fuzzer for web applications,”arXiv preprint arXiv:2108.08455, 2021

work page arXiv 2021
[45]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

Y . Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, (New York, NY , USA), p. 423–435, Association for Computing Machinery, 2023

work page 2023
[46]

Fuzz4ALL: Universal fuzzing with large language models,

C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Fuzz4ALL: Universal fuzzing with large language models,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), (Los Alamitos, CA, USA), pp. 1547–1559, IEEE Computer Society, Apr. 2024

work page 2024
[47]

CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,” inProceedings of the 45th International Confer- ence on Software Engineering, ICSE ’23, p. 919–931, IEEE Press, 2023

work page 2023
[48]

Harnessing large language models for seed generation in greybox fuzzing,

W. Shi, Y . Zhang, X. Xing, and J. Xu, “Harnessing large language models for seed generation in greybox fuzzing,”arXiv preprint arXiv:2411.18143, 2024

work page arXiv 2024
[49]

C. Chen, B. Dolan-Gavitt, and Z. Lin,ELFUZZ: Efficient input gen- eration via LLM-driven synthesis over fuzzer space. USA: USENIX Association, 2025

work page 2025
[50]

Augmenting greybox fuzzing with generative AI,

J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing with generative AI,”arXiv preprint arXiv:2306.06782, 2023

work page arXiv 2023
[51]

Large language model guided protocol fuzzing,

R. Meng, M. Mirchev, M. B ¨ohme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inProceedings of the 31st Annual Network and Distributed System Security Symposium, NDSS 2024, 2024

work page 2024
[52]

WhiteFox: White-box compiler fuzzing empowered by large lan- guage models,

C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang, “WhiteFox: White-box compiler fuzzing empowered by large lan- guage models,”Proc. ACM Program. Lang., vol. 8, Oct. 2024

work page 2024
[53]

C. Yang, Z. Zhao, and L. Zhang,KernelGPT: Enhanced kernel fuzzing via large language models, p. 560–573. New York, NY , USA: Association for Computing Machinery, 2025

work page 2025
[54]

OSS- Fuzz-Gen: Automated fuzz target generation,

D. Liu, O. Chang, J. metzman, M. Sablotny, and M. Maruseac, “OSS- Fuzz-Gen: Automated fuzz target generation,” May 2024

work page 2024
[55]

How effective are they? Exploring large language model based fuzz driver generation,

C. Zhang, Y . Zheng, M. Bai, Y . Li, W. Ma, X. Xie, Y . Li, L. Sun, and Y . Liu, “How effective are they? Exploring large language model based fuzz driver generation,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, (New York, NY , USA), p. 1223–1235, Association for Computing Machinery, 2024

work page 2024
[56]

Prompt fuzzing for fuzz driver generation,

Y . Lyu, Y . Xie, P. Chen, and H. Chen, “Prompt fuzzing for fuzz driver generation,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, (New York, NY , USA), p. 3793–3807, Association for Computing Machinery, 2024. Appendix A. Benchmark Contamination Analysis Table 11 summarizes the composition and contami...

work page 2024
[57]

The paper provides a valuable step forward in an established field via a coherent end-to-end design tar- geting the reachability–exploitability gap in sink-based vulnerability discovery

work page
[58]

The paper creates a new tool to enable future science which provides substantial improvement over the cho- sen baseline, supported by ablations analysis

work page
[59]

The new tool offers potential community value if the framework and benchmark are released. D.4. Noteworthy Concerns

work page
[60]

Internal precision is very low ( ∼14%), implying many false positives and unclear practical reliability

work page
[61]

Some cheaper, open-source models may do better in terms of exploita- tion success, but internal working effectiveness remains overlooked, casting a shadow on performance

Best performance relies on expensive flagship LLMs that lead to high monetary cost; cheaper models seem to significantly degrade performance. Some cheaper, open-source models may do better in terms of exploita- tion success, but internal working effectiveness remains overlooked, casting a shadow on performance

work page
[62]

While a set of state-of-the-art industry tools is considered, those academic works are overlooked, leaving the technical advancement unclear

Comparison is limited to Jazzer, lacking evaluation against state-of-the-art analyzers (e.g., IRIS@ICLR’25, RepoAudit@ICML’25, LLMDFA@NeuIPS’24) and stronger fuzzers (e.g., PolyFuzz@USENIX Secu- rity’23). While a set of state-of-the-art industry tools is considered, those academic works are overlooked, leaving the technical advancement unclear. Appendix E...

work page

[1] [1]

Apache Log4j vulnerability guidance

CISA, “Apache Log4j vulnerability guidance.” https://www.cisa.gov /news-events/news/apache-log4j-vulnerability-guidance, 2021

work page 2021

[2] [2]

Jazzer: Coverage-guided, in-process fuzzing for the JVM

Code Intelligence, “Jazzer: Coverage-guided, in-process fuzzing for the JVM.” https://github.com/CodeIntelligenceTesting/jazzer, 2021

work page 2021

[3] [3]

JQF: Coverage-guided property- based testing in Java,

R. Padhye, C. Lemieux, and K. Sen, “JQF: Coverage-guided property- based testing in Java,” inProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, (New York, NY , USA), p. 398–401, Association for Computing Machinery, 2019

work page 2019

[4] [4]

FUGIO: Automatic exploit generation for PHP object injection vulnerabilities,

S. Park, D. Kim, S. Jana, and S. Son, “FUGIO: Automatic exploit generation for PHP object injection vulnerabilities,” in31st USENIX Security Symposium (USENIX Security 22), (Boston, MA), USENIX Association, 2022

work page 2022

[5] [5]

ODDFuzz: Discovering Java deserialization vulnerabilities via structure-aware directed greybox fuzzing,

S. Cao, B. He, X. Sun, Y . Ouyang, C. Zhang, X. Wu, T. Su, L. Bo, B. Li, C. Ma, J. Li, and T. Wei, “ODDFuzz: Discovering Java deserialization vulnerabilities via structure-aware directed greybox fuzzing,” in2023 IEEE Symposium on Security and Privacy (SP), (Los Alamitos, CA, USA), pp. 2726–2743, IEEE Computer Society, May 2023

work page 2023

[6] [6]

Where URLs become weapons: Automated discovery of SSRF vulnerabilities in web applications,

E. Wang, J. Chen, W. Xie, C. Wang, Y . Gao, Z. Wang, H. Duan, Y . Liu, and B. Wang, “Where URLs become weapons: Automated discovery of SSRF vulnerabilities in web applications,” in2024 IEEE Symposium on Security and Privacy (SP), pp. 239–257, 2024

work page 2024

[7] [7]

Remote code execution from SSTI in the sandbox: Automatically detecting and exploiting template escape bugs,

Y . Zhao, Y . Zhang, and M. Yang, “Remote code execution from SSTI in the sandbox: Automatically detecting and exploiting template escape bugs,” inProceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, (USA), USENIX Association, 2023

work page 2023

[8] [8]

Efficient detection of Java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction,

B. Chen, L. Zhang, X. Huang, Y . Cao, K. Lian, Y . Zhang, and M. Yang, “Efficient detection of Java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction,” in 2024 IEEE Symposium on Security and Privacy (SP), pp. 3961–3978, 2024

work page 2024

[9] [9]

Huang, L

X. Huang, L. Zhang, Y . Liu, P. Deng, Y . Cao, Y . Zhang, and M. Yang, Towards automatic detection and exploitation of Java web application vulnerabilities via concolic execution guided by cross-thread object manipulation. USA: USENIX Association, 2025

work page 2025

[10] [10]

Z. Lin, Y . Zhang, J. Dai, X. Huang, B. Xiang, G. Yang, L. Yuan, L. Zhang, F. Liu, T. Chen, and M. Yang,Effective directed fuzzing with hierarchical scheduling for web vulnerability detection. USA: USENIX Association, 2025

work page 2025

[11] [11]

Toss a fault to your Witcher: Applying grey-box coverage-guided mutational fuzzing to detect SQL and command injection vulnerabilities,

E. Trickel, F. Pagani, C. Zhu, L. Dresel, G. Vigna, C. Kruegel, R. Wang, T. Bao, Y . Shoshitaishvili, and A. Doup ´e, “Toss a fault to your Witcher: Applying grey-box coverage-guided mutational fuzzing to detect SQL and command injection vulnerabilities,” in2023 IEEE Symposium on Security and Privacy (SP), pp. 2658–2675, 2023

work page 2023

[12] [12]

Atropos: Effective fuzzing of web applications for server-side vulnerabilities,

E. G ¨uler, S. Schumilo, M. Schloegel, N. Bars, P. G ¨orz, X. Xu, C. Kaygusuz, and T. Holz, “Atropos: Effective fuzzing of web applications for server-side vulnerabilities,” inProceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, (USA), USENIX Association, 2024

work page 2024

[13] [13]

Predator: Directed web application fuzzing for efficient vulnerability validation,

C. Wang, W. Meng, C. Luo, and P. Li, “Predator: Directed web application fuzzing for efficient vulnerability validation,” in2025 IEEE Symposium on Security and Privacy (SP), pp. 886–902, 2025

work page 2025

[14] [14]

Semantic code analysis engine

GitHub, “CodeQL.” https://github.com/github/codeql. Semantic code analysis engine

work page

[15] [15]

OSS-CRS: Open Source Cyber Reasoning System

OpenSSF, “OSS-CRS: Open Source Cyber Reasoning System.” https: //github.com/ossf/oss-crs, 2025

work page 2025

[16] [16]

Jazzer sanitizers

Code Intelligence, “Jazzer sanitizers.” https://github.com/CodeIntelli genceTesting/jazzer/tree/main/sanitizers/src/main/java/com/code int elligence/jazzer/sanitizers, 2021

work page 2021

[17] [17]

libFuzzer – value profile

LLVM Project, “libFuzzer – value profile.” https://llvm.org/docs/Li bFuzzer.html#value-profile. Accessed: 2025

work page 2025

[18] [18]

OSS-Fuzz: Continuous fuzzing for open source software

Google, “OSS-Fuzz: Continuous fuzzing for open source software.” https://github.com/google/oss-fuzz, 2016

work page 2016

[19] [19]

LangChain

H. Chase, “LangChain.” https://github.com/langchain-ai/langchain, Oct. 2022

work page 2022

[20] [20]

LangGraph

LangChain AI, “LangGraph.” https://github.com/langchain-ai/langg raph, 2024

work page 2024

[21] [21]

Joern: The bug hunter’s workbench

joern.io, “Joern: The bug hunter’s workbench.” https://github.com/j oernio/joern, 2024

work page 2024

[22] [22]

Optimization of object-oriented programs using static class hierarchy analysis,

J. Dean, D. Grove, and C. Chambers, “Optimization of object-oriented programs using static class hierarchy analysis,” inECOOP’95 — Object-Oriented Programming, 9th European Conference, ˚Aarhus, Denmark, August 7–11, 1995(M. Tokoro and R. Pareschi, eds.), (Berlin, Heidelberg), pp. 77–101, Springer Berlin Heidelberg, 1995

work page 1995

[23] [23]

Fast static analysis of C++ virtual function calls,

D. F. Bacon and P. F. Sweeney, “Fast static analysis of C++ virtual function calls,”SIGPLAN Not., vol. 31, p. 324–341, Oct. 1996

work page 1996

[24] [24]

IRIS: LLM-assisted static analysis for detecting security vulnerabilities,

Z. Li, S. Dutta, and M. Naik, “IRIS: LLM-assisted static analysis for detecting security vulnerabilities,” inInternational Conference on Learning Representations(Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, eds.), vol. 2025, pp. 35735–35758, 2025

work page 2025

[25] [25]

DARPA AI Cyber Challenge (AIxCC)

Defense Advanced Research Projects Agency (DARPA), “DARPA AI Cyber Challenge (AIxCC).” https://aicyberchallenge.com/, 2023. Accessed: 2025-11-13

work page 2023

[26] [26]

Amazon EC2 on-demand pricing

Amazon Web Services, “Amazon EC2 on-demand pricing.” https: //aws.amazon.com/ec2/pricing/on-demand/. Accessed: 2025-11-13

work page 2025

[27] [27]

SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Archi- tectures, and Lessons Learned,

C. Zhang, Y . Park, F. Fleischer, Y .-F. Fu, J. Kim, D. Kim, Y . Kim, Q. Xu, A. Chin, Z. Sheng, H. Zhao, B. J. Lee, J. Wang, M. Pel- ican, D. J. Musliner, J. Huang, J. Silliman, M. Mcdaniel, J. Casa- vant, I. Goldthwaite, N. Vidovich, M. Lehman, and T. Kim, “SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Archi- tectures, and Lessons Learned,” 2026

work page 2026

[28] [28]

OSS- Fuzz-Gen: Automated fuzz target generation

D. Liu, O. Chang, J. Metzman, M. Sablotny, and M. Maruseac, “OSS- Fuzz-Gen: Automated fuzz target generation.” https://github.com/goo gle/oss-fuzz-gen, 2024

work page 2024

[29] [29]

ysoserial: A proof-of-concept tool for generating payloads that exploit unsafe Java object deserializa- tion

C. Frohoff, G. Lawrence,et al., “ysoserial: A proof-of-concept tool for generating payloads that exploit unsafe Java object deserializa- tion.” https://github.com/frohoff/ysoserial, 2015

work page 2015

[30] [30]

recheck: The trustworthy ReDoS checker

Makenowjust Labs, “recheck: The trustworthy ReDoS checker.” https: //makenowjust-labs.github.io/recheck/. Accessed: 2025

work page 2025

[31] [31]

Atheris: A coverage-guided, native Python fuzzer

Google, “Atheris: A coverage-guided, native Python fuzzer.” https: //github.com/google/atheris, 2020

work page 2020

[32] [32]

go-fuzz: Randomized testing for Go

D. Vyukov, “go-fuzz: Randomized testing for Go.” https://github.c om/dvyukov/go-fuzz, 2015

work page 2015

[33] [33]

A hybrid analysis to detect Java se- rialisation vulnerabilities,

S. Rasheed and J. Dietrich, “A hybrid analysis to detect Java se- rialisation vulnerabilities,” in2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1209– 1213, 2020

work page 2020

[34] [34]

Improving Java deserialization gadget chain mining via overriding-guided object generation,

S. Cao, X. Sun, X. Wu, L. Bo, B. Li, R. Wu, W. Liu, B. He, Y . Ouyang, and J. Li, “Improving Java deserialization gadget chain mining via overriding-guided object generation,” inProceedings of the 45th International Conference on Software Engineering, ICSE ’23, p. 397–409, IEEE Press, 2023

work page 2023

[35] [35]

Automated discovery of deserialization gadget chains,

I. Haken, “Automated discovery of deserialization gadget chains,” in Black Hat USA, 2018

work page 2018

[36] [36]

Acquirer: A hybrid approach to detecting algorithmic complexity vulnerabilities,

Y . Liu and W. Meng, “Acquirer: A hybrid approach to detecting algorithmic complexity vulnerabilities,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Secu- rity, CCS ’22, (New York, NY , USA), p. 2071–2084, Association for Computing Machinery, 2022

work page 2022

[37] [37]

HotFuzz: Discovering temporal and spatial denial-of-service vulnerabilities through guided micro-fuzzing,

W. Blair, A. Mambretti, S. Arshad, M. Weissbacher, W. Robertson, E. Kirda, and M. Egele, “HotFuzz: Discovering temporal and spatial denial-of-service vulnerabilities through guided micro-fuzzing,”ACM Trans. Priv. Secur., vol. 25, July 2022

work page 2022

[38] [38]

UChecker: Automatically detecting PHP-based unrestricted file upload vulnerabilities,

J. Huang, Y . Li, J. Zhang, and R. Dai, “UChecker: Automatically detecting PHP-based unrestricted file upload vulnerabilities,” in2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 581–592, 2019

work page 2019

[39] [39]

FUSE: Finding file upload bugs via penetration testing,

T. Lee, S. Wi, S. Lee, and S. Son, “FUSE: Finding file upload bugs via penetration testing,” inNetwork and Distributed System Security Symposium (NDSS), 2020

work page 2020

[40] [40]

Undefined-oriented programming: De- tecting and chaining prototype pollution gadgets in Node.js template engines for malicious consequences,

Z. Liu, K. An, and Y . Cao, “Undefined-oriented programming: De- tecting and chaining prototype pollution gadgets in Node.js template engines for malicious consequences,” in2024 IEEE Symposium on Security and Privacy (SP), pp. 4015–4033, 2024

work page 2024

[41] [41]

Detecting Node.js prototype pollution vulnerabilities via object lookup analysis,

S. Li, M. Kang, J. Hou, and Y . Cao, “Detecting Node.js prototype pollution vulnerabilities via object lookup analysis,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, (New York, NY , USA), p. 268–279, Association for Computing Machinery, 2021

work page 2021

[42] [42]

Silent spring: Pro- totype pollution leads to remote code execution in Node.js,

M. Shcherbakov, M. Balliu, and C.-A. Staicu, “Silent spring: Pro- totype pollution leads to remote code execution in Node.js,” in Proceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, (USA), USENIX Association, 2023

work page 2023

[43] [43]

webFuzz: Grey-box fuzzing for web ap- plications,

O. van Rooij, M. A. Charalambous, D. Kaizer, M. Papaevripides, and E. Athanasopoulos, “webFuzz: Grey-box fuzzing for web ap- plications,” inComputer Security – ESORICS 2021: 26th European Symposium on Research in Computer Security, Darmstadt, Ger- many, October 4–8, 2021, Proceedings, Part I, (Berlin, Heidelberg), p. 152–172, Springer-Verlag, 2021

work page 2021

[44] [44]

Backrest: A model-based feedback-driven greybox fuzzer for web applications,

F. Gauthier, B. Hassanshahi, B. Selwyn-Smith, T. N. Mai, M. Schl¨uter, and M. Williams, “Backrest: A model-based feedback-driven greybox fuzzer for web applications,”arXiv preprint arXiv:2108.08455, 2021

work page arXiv 2021

[45] [45]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

Y . Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, (New York, NY , USA), p. 423–435, Association for Computing Machinery, 2023

work page 2023

[46] [46]

Fuzz4ALL: Universal fuzzing with large language models,

C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Fuzz4ALL: Universal fuzzing with large language models,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), (Los Alamitos, CA, USA), pp. 1547–1559, IEEE Computer Society, Apr. 2024

work page 2024

[47] [47]

CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,” inProceedings of the 45th International Confer- ence on Software Engineering, ICSE ’23, p. 919–931, IEEE Press, 2023

work page 2023

[48] [48]

Harnessing large language models for seed generation in greybox fuzzing,

W. Shi, Y . Zhang, X. Xing, and J. Xu, “Harnessing large language models for seed generation in greybox fuzzing,”arXiv preprint arXiv:2411.18143, 2024

work page arXiv 2024

[49] [49]

C. Chen, B. Dolan-Gavitt, and Z. Lin,ELFUZZ: Efficient input gen- eration via LLM-driven synthesis over fuzzer space. USA: USENIX Association, 2025

work page 2025

[50] [50]

Augmenting greybox fuzzing with generative AI,

J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing with generative AI,”arXiv preprint arXiv:2306.06782, 2023

work page arXiv 2023

[51] [51]

Large language model guided protocol fuzzing,

R. Meng, M. Mirchev, M. B ¨ohme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inProceedings of the 31st Annual Network and Distributed System Security Symposium, NDSS 2024, 2024

work page 2024

[52] [52]

WhiteFox: White-box compiler fuzzing empowered by large lan- guage models,

C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang, “WhiteFox: White-box compiler fuzzing empowered by large lan- guage models,”Proc. ACM Program. Lang., vol. 8, Oct. 2024

work page 2024

[53] [53]

C. Yang, Z. Zhao, and L. Zhang,KernelGPT: Enhanced kernel fuzzing via large language models, p. 560–573. New York, NY , USA: Association for Computing Machinery, 2025

work page 2025

[54] [54]

OSS- Fuzz-Gen: Automated fuzz target generation,

D. Liu, O. Chang, J. metzman, M. Sablotny, and M. Maruseac, “OSS- Fuzz-Gen: Automated fuzz target generation,” May 2024

work page 2024

[55] [55]

How effective are they? Exploring large language model based fuzz driver generation,

C. Zhang, Y . Zheng, M. Bai, Y . Li, W. Ma, X. Xie, Y . Li, L. Sun, and Y . Liu, “How effective are they? Exploring large language model based fuzz driver generation,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, (New York, NY , USA), p. 1223–1235, Association for Computing Machinery, 2024

work page 2024

[56] [56]

Prompt fuzzing for fuzz driver generation,

Y . Lyu, Y . Xie, P. Chen, and H. Chen, “Prompt fuzzing for fuzz driver generation,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, (New York, NY , USA), p. 3793–3807, Association for Computing Machinery, 2024. Appendix A. Benchmark Contamination Analysis Table 11 summarizes the composition and contami...

work page 2024

[57] [57]

The paper provides a valuable step forward in an established field via a coherent end-to-end design tar- geting the reachability–exploitability gap in sink-based vulnerability discovery

work page

[58] [58]

The paper creates a new tool to enable future science which provides substantial improvement over the cho- sen baseline, supported by ablations analysis

work page

[59] [59]

The new tool offers potential community value if the framework and benchmark are released. D.4. Noteworthy Concerns

work page

[60] [60]

Internal precision is very low ( ∼14%), implying many false positives and unclear practical reliability

work page

[61] [61]

Some cheaper, open-source models may do better in terms of exploita- tion success, but internal working effectiveness remains overlooked, casting a shadow on performance

Best performance relies on expensive flagship LLMs that lead to high monetary cost; cheaper models seem to significantly degrade performance. Some cheaper, open-source models may do better in terms of exploita- tion success, but internal working effectiveness remains overlooked, casting a shadow on performance

work page

[62] [62]

While a set of state-of-the-art industry tools is considered, those academic works are overlooked, leaving the technical advancement unclear

Comparison is limited to Jazzer, lacking evaluation against state-of-the-art analyzers (e.g., IRIS@ICLR’25, RepoAudit@ICML’25, LLMDFA@NeuIPS’24) and stronger fuzzers (e.g., PolyFuzz@USENIX Secu- rity’23). While a set of state-of-the-art industry tools is considered, those academic works are overlooked, leaving the technical advancement unclear. Appendix E...

work page