pith. sign in

arxiv: 2604.01645 · v3 · submitted 2026-04-02 · 💻 cs.CR

Contextualizing Sink Knowledge for Java Vulnerability Discovery

Pith reviewed 2026-05-13 21:45 UTC · model grok-4.3

classification 💻 cs.CR
keywords Java vulnerability discoverysink-centric fuzzingCWE scanningLLM-assisted filteringexploration agentexploitation agentproof-of-concept exploitscoverage-guided fuzzer
0
0 comments X

The pith

GONDAR discovers four times more Java vulnerabilities than Jazzer by targeting sink APIs with LLM filtering and collaborative agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a fuzzing method that focuses on security-sensitive API calls called sinks in Java code. These sinks carry key details about how to reach them and how to trigger exploits. GONDAR first locates reachable sinks using CWE-based scans plus LLM static checks, then runs two agents alongside a coverage fuzzer. One agent solves paths to reach the sinks while the other builds working exploits. The components share seeds and feedback in real time.

Core claim

GONDAR is a sink-centric fuzzing framework that systematically leverages sink API semantics for targeted vulnerability discovery. It identifies reachable and exploitable sink call sites through CWE-specific scanning combined with LLM-assisted static filtering, then deploys an exploration agent to generate inputs that reach target call sites by solving path constraints and an exploitation agent to synthesize proof-of-concept exploits by satisfying vulnerability-triggering conditions. The agents and fuzzer continuously exchange seeds and runtime feedback.

What carries the argument

Two specialized agents—an exploration agent that solves path constraints to reach sink call sites and an exploitation agent that reasons about vulnerability conditions—working collaboratively with a coverage-guided fuzzer.

Load-bearing premise

LLM-assisted static filtering combined with CWE scanning can reliably identify reachable and exploitable sink call sites without excessive false positives or missed targets.

What would settle it

Running GONDAR and Jazzer side-by-side on the same real-world Java benchmarks and counting the distinct vulnerabilities each finds; a ratio close to one would falsify the four-times improvement claim.

Figures

Figures reproduced from arXiv: 2604.01645 by Cen Zhang, Fabian Fleischer, Jeongin Cho, Joonun Jang, Meng Xu, Taesoo Kim.

Figure 1
Figure 1. Figure 1: Command injection vulnerability in Jenkins from AIxCC semifinal exemplar. The vulnerability requires satisfying multiple conditions to reach the ProcessBuilder sink (line 20) and specific input properties to trigger exploitation (line 23). guide. ③ Complex exploitation logic (18/21): this represents the most common failure mode. Exploitation required rea￾soning about intricate conditions, multiple API inte… view at source ↗
Figure 2
Figure 2. Figure 2: Overall design of GONDAR. Robot icons indicate LLM-based components. value must match the SHA-256 hash of "breakin the law". When both conditions hold, execution invokes createUtils, which constructs a ProcessBuilder with attacker-controlled arguments. The ProcessBuilder constructor at line 20 is the sink, i.e., a security-sensitive API where attacker￾controllable arguments enable command execution. Jazzer… view at source ↗
Figure 3
Figure 3. Figure 3: Coordinate diagram showing the relationship between vulnerabilities reached and exploited for different tools and con￾figurations. retains 52 out of 54 expected vulnerabilities; two false negatives stem from Joern’s inability to resolve reflective calls and lambdas, a general limitation of Java static anal￾ysis. LLM cost scales with analyzed sinks, with command injection most expensive due to iterative pay… view at source ↗
Figure 4
Figure 4. Figure 4: Spider chart showing normalized reached and exploited vulnerabilities per CWE type for baseline settings and different configurations of GONDAR. The value at each axis is calculated as the number of vulnerabilities reached/exploited divided by the total number of vulnerabilities for that CWE type. Fuzzing Cost vs. LLM Cost. Since both large-scale fuzzing and LLM usage can incur significant costs, we analyz… view at source ↗
Figure 5
Figure 5. Figure 5: Per-vulnerability matrix showing reached (light) and exploited (dark) status for each of the 54 vulnerabilities across all configurations. Takeaway 3: Open-weight models achieve near-flagship effec￾tiveness at a fraction of the cost: GLM-5 exploits 35 vulnera￾bilities at $392 (8× cheaper than flagships), offering a viable option for cost-sensitive deployments or privacy-constrained environments that requir… view at source ↗
Figure 6
Figure 6. Figure 6: Jazzer coverage over time per project [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Java applications are prone to vulnerabilities stemming from the insecure use of security-sensitive APIs, such as file operations enabling path traversal or deserialization routines allowing remote code execution. These sink APIs encode critical information for vulnerability discovery: the program-specific constraints required to reach them and the exploitation conditions necessary to trigger security flaws. Despite this, existing fuzzers largely overlook such vulnerability-specific knowledge, limiting their effectiveness. We present GONDAR, a sink-centric fuzzing framework that systematically leverages sink API semantics for targeted vulnerability discovery. GONDAR first identifies reachable and exploitable sink call sites through CWE-specific scanning combined with LLM-assisted static filtering. It then deploys two specialized agents that work collaboratively with a coverage-guided fuzzer: an exploration agent generates inputs to reach target call sites by iteratively solving path constraints, while an exploitation agent synthesizes proof-of-concept exploits by reasoning about and satisfying vulnerability-triggering conditions. The agents and fuzzer continuously exchange seeds and runtime feedback, complementing each other. We evaluated GONDAR on real-world Java benchmarks, where it discovers four times more vulnerabilities than Jazzer, the state-of-the-art Java fuzzer. Notably, an earlier GONDAR version contributed to Team Atlanta's first-place CRS in the DARPA AI Cyber Challenge, and is integrated into OSS-CRS, a sandbox project in The Linux Foundation's OpenSSF, to analyze open-source Java projects, where it has already uncovered a zero-day vulnerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents GONDAR, a sink-centric fuzzing framework for Java vulnerability discovery. It first identifies reachable and exploitable sink call sites via CWE-specific scanning plus LLM-assisted static filtering, then deploys an exploration agent (to reach targets by solving path constraints) and an exploitation agent (to synthesize PoC exploits by satisfying vulnerability conditions) that collaborate with a coverage-guided fuzzer through seed and feedback exchange. On real-world Java benchmarks the framework is reported to discover four times more vulnerabilities than Jazzer; an earlier version contributed to a first-place DARPA AI Cyber Challenge result and has been integrated into OSS-CRS where it found a zero-day.

Significance. If the headline performance claims are substantiated with proper controls and metrics, the work would demonstrate that explicit sink-semantic knowledge can materially improve directed fuzzing effectiveness over general-purpose tools, with immediate practical value shown by the DARPA placement and zero-day discovery. The agent-based decomposition of reachability and exploit synthesis is a concrete engineering contribution that could be adopted or extended by other fuzzing pipelines.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the central claim of a 4× improvement over Jazzer is presented without any description of benchmark selection criteria, vulnerability counting methodology, statistical significance testing, or comparison controls. Because the headline result is load-bearing for the paper’s contribution, these omissions prevent verification that the gains are attributable to sink-centric targeting rather than other factors.
  2. [Approach / Sink Identification] Sink-identification pipeline (described in the approach section): no precision, recall, or false-positive rate is reported for the CWE-specific scanning combined with LLM-assisted static filtering. This step is the prerequisite that supplies targets to both agents; without quantified reliability on the same benchmarks used for the 4× claim, the attribution of improved vulnerability discovery to “contextualizing sink knowledge” remains unverified.
minor comments (2)
  1. [Approach] The interaction protocol between the exploration agent, exploitation agent, and the underlying fuzzer is described at a high level; a diagram or pseudocode listing the seed-exchange and feedback loop would improve clarity.
  2. [Abstract / Evaluation] The abstract states that GONDAR “has already uncovered a zero-day,” but the main text does not appear to provide a case study or CVE reference for that finding; adding a brief description would strengthen the practical-impact claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. Below we provide point-by-point responses to the major comments. We will revise the manuscript to incorporate additional details and evaluations as outlined.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim of a 4× improvement over Jazzer is presented without any description of benchmark selection criteria, vulnerability counting methodology, statistical significance testing, or comparison controls. Because the headline result is load-bearing for the paper’s contribution, these omissions prevent verification that the gains are attributable to sink-centric targeting rather than other factors.

    Authors: We acknowledge that the abstract is concise and does not include these methodological details. The Evaluation section of the manuscript describes the benchmarks as real-world Java applications and reports the number of vulnerabilities discovered, but we agree that more explicit criteria and methodology would strengthen the claims. In the revised manuscript, we will expand the Evaluation section with a subsection on 'Benchmark and Evaluation Methodology' that details: (1) benchmark selection criteria (e.g., open-source projects with known security issues, varying sizes from 10k to 100k LOC), (2) vulnerability counting as the number of unique, reproducible security violations (confirmed via stack traces or CVE matching) found within a 24-hour budget per tool, (3) note on the absence of statistical significance testing due to the high cost of repeated experiments but consistency across runs, and (4) additional controls including runs of Jazzer augmented with our identified sinks. This will better substantiate that the improvements stem from the sink-centric approach. revision: yes

  2. Referee: [Approach / Sink Identification] Sink-identification pipeline (described in the approach section): no precision, recall, or false-positive rate is reported for the CWE-specific scanning combined with LLM-assisted static filtering. This step is the prerequisite that supplies targets to both agents; without quantified reliability on the same benchmarks used for the 4× claim, the attribution of improved vulnerability discovery to “contextualizing sink knowledge” remains unverified.

    Authors: We agree that empirical validation of the sink identification pipeline is crucial. The Approach section details the CWE-specific scanning rules and the LLM-assisted filtering process, including the prompts used. However, we did not report precision and recall because establishing ground truth for sink reachability and exploitability across the entire benchmark suite would require extensive manual effort. For the revision, we will add an 'Accuracy of Sink Identification' subsection in the Evaluation, where we select a random sample of 100 methods from the benchmarks, manually determine reachable and exploitable sinks, and compute precision, recall, and F1 scores for the CWE scanner alone and the combined pipeline. This will provide the necessary quantification and support the attribution of performance gains to our sink knowledge contextualization. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework evaluated against external baseline

full rationale

The paper describes GONDAR as a practical fuzzing pipeline that identifies sinks via CWE scanning plus LLM-assisted static filtering, then deploys exploration and exploitation agents alongside a coverage-guided fuzzer. No equations, fitted parameters, or self-referential derivations appear in the provided text. The headline performance claim (4x vulnerabilities vs. Jazzer plus a zero-day) is measured directly against an external tool on real-world benchmarks, making the result falsifiable outside any internal construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The pipeline steps are presented as implementation choices whose effectiveness is assessed empirically rather than by definition or renaming of prior results. This satisfies the criteria for a self-contained artifact with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that LLM static filtering produces high-quality sink targets and that the agents can solve path constraints and vulnerability conditions in practice; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption LLM-assisted static filtering can accurately distinguish reachable and exploitable sink call sites from false positives
    Invoked in the first step of GONDAR to select targets for the agents
invented entities (2)
  • Exploration agent no independent evidence
    purpose: Iteratively solve path constraints to reach target sink call sites
    New component introduced to complement the coverage-guided fuzzer
  • Exploitation agent no independent evidence
    purpose: Synthesize proof-of-concept exploits by satisfying vulnerability-triggering conditions
    New component introduced to complement the coverage-guided fuzzer

pith-pipeline@v0.9.0 · 5564 in / 1497 out tokens · 34109 ms · 2026-05-13T21:45:38.759653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Apache Log4j vulnerability guidance

    CISA, “Apache Log4j vulnerability guidance.” https://www.cisa.gov /news-events/news/apache-log4j-vulnerability-guidance, 2021

  2. [2]

    Jazzer: Coverage-guided, in-process fuzzing for the JVM

    Code Intelligence, “Jazzer: Coverage-guided, in-process fuzzing for the JVM.” https://github.com/CodeIntelligenceTesting/jazzer, 2021

  3. [3]

    JQF: Coverage-guided property- based testing in Java,

    R. Padhye, C. Lemieux, and K. Sen, “JQF: Coverage-guided property- based testing in Java,” inProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, (New York, NY , USA), p. 398–401, Association for Computing Machinery, 2019

  4. [4]

    FUGIO: Automatic exploit generation for PHP object injection vulnerabilities,

    S. Park, D. Kim, S. Jana, and S. Son, “FUGIO: Automatic exploit generation for PHP object injection vulnerabilities,” in31st USENIX Security Symposium (USENIX Security 22), (Boston, MA), USENIX Association, 2022

  5. [5]

    ODDFuzz: Discovering Java deserialization vulnerabilities via structure-aware directed greybox fuzzing,

    S. Cao, B. He, X. Sun, Y . Ouyang, C. Zhang, X. Wu, T. Su, L. Bo, B. Li, C. Ma, J. Li, and T. Wei, “ODDFuzz: Discovering Java deserialization vulnerabilities via structure-aware directed greybox fuzzing,” in2023 IEEE Symposium on Security and Privacy (SP), (Los Alamitos, CA, USA), pp. 2726–2743, IEEE Computer Society, May 2023

  6. [6]

    Where URLs become weapons: Automated discovery of SSRF vulnerabilities in web applications,

    E. Wang, J. Chen, W. Xie, C. Wang, Y . Gao, Z. Wang, H. Duan, Y . Liu, and B. Wang, “Where URLs become weapons: Automated discovery of SSRF vulnerabilities in web applications,” in2024 IEEE Symposium on Security and Privacy (SP), pp. 239–257, 2024

  7. [7]

    Remote code execution from SSTI in the sandbox: Automatically detecting and exploiting template escape bugs,

    Y . Zhao, Y . Zhang, and M. Yang, “Remote code execution from SSTI in the sandbox: Automatically detecting and exploiting template escape bugs,” inProceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, (USA), USENIX Association, 2023

  8. [8]

    Efficient detection of Java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction,

    B. Chen, L. Zhang, X. Huang, Y . Cao, K. Lian, Y . Zhang, and M. Yang, “Efficient detection of Java deserialization gadget chains via bottom-up gadget search and dataflow-aided payload construction,” in 2024 IEEE Symposium on Security and Privacy (SP), pp. 3961–3978, 2024

  9. [9]

    Huang, L

    X. Huang, L. Zhang, Y . Liu, P. Deng, Y . Cao, Y . Zhang, and M. Yang, Towards automatic detection and exploitation of Java web application vulnerabilities via concolic execution guided by cross-thread object manipulation. USA: USENIX Association, 2025

  10. [10]

    Z. Lin, Y . Zhang, J. Dai, X. Huang, B. Xiang, G. Yang, L. Yuan, L. Zhang, F. Liu, T. Chen, and M. Yang,Effective directed fuzzing with hierarchical scheduling for web vulnerability detection. USA: USENIX Association, 2025

  11. [11]

    Toss a fault to your Witcher: Applying grey-box coverage-guided mutational fuzzing to detect SQL and command injection vulnerabilities,

    E. Trickel, F. Pagani, C. Zhu, L. Dresel, G. Vigna, C. Kruegel, R. Wang, T. Bao, Y . Shoshitaishvili, and A. Doup ´e, “Toss a fault to your Witcher: Applying grey-box coverage-guided mutational fuzzing to detect SQL and command injection vulnerabilities,” in2023 IEEE Symposium on Security and Privacy (SP), pp. 2658–2675, 2023

  12. [12]

    Atropos: Effective fuzzing of web applications for server-side vulnerabilities,

    E. G ¨uler, S. Schumilo, M. Schloegel, N. Bars, P. G ¨orz, X. Xu, C. Kaygusuz, and T. Holz, “Atropos: Effective fuzzing of web applications for server-side vulnerabilities,” inProceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, (USA), USENIX Association, 2024

  13. [13]

    Predator: Directed web application fuzzing for efficient vulnerability validation,

    C. Wang, W. Meng, C. Luo, and P. Li, “Predator: Directed web application fuzzing for efficient vulnerability validation,” in2025 IEEE Symposium on Security and Privacy (SP), pp. 886–902, 2025

  14. [14]

    Semantic code analysis engine

    GitHub, “CodeQL.” https://github.com/github/codeql. Semantic code analysis engine

  15. [15]

    OSS-CRS: Open Source Cyber Reasoning System

    OpenSSF, “OSS-CRS: Open Source Cyber Reasoning System.” https: //github.com/ossf/oss-crs, 2025

  16. [16]

    Jazzer sanitizers

    Code Intelligence, “Jazzer sanitizers.” https://github.com/CodeIntelli genceTesting/jazzer/tree/main/sanitizers/src/main/java/com/code int elligence/jazzer/sanitizers, 2021

  17. [17]

    libFuzzer – value profile

    LLVM Project, “libFuzzer – value profile.” https://llvm.org/docs/Li bFuzzer.html#value-profile. Accessed: 2025

  18. [18]

    OSS-Fuzz: Continuous fuzzing for open source software

    Google, “OSS-Fuzz: Continuous fuzzing for open source software.” https://github.com/google/oss-fuzz, 2016

  19. [19]

    LangChain

    H. Chase, “LangChain.” https://github.com/langchain-ai/langchain, Oct. 2022

  20. [20]

    LangGraph

    LangChain AI, “LangGraph.” https://github.com/langchain-ai/langg raph, 2024

  21. [21]

    Joern: The bug hunter’s workbench

    joern.io, “Joern: The bug hunter’s workbench.” https://github.com/j oernio/joern, 2024

  22. [22]

    Optimization of object-oriented programs using static class hierarchy analysis,

    J. Dean, D. Grove, and C. Chambers, “Optimization of object-oriented programs using static class hierarchy analysis,” inECOOP’95 — Object-Oriented Programming, 9th European Conference, ˚Aarhus, Denmark, August 7–11, 1995(M. Tokoro and R. Pareschi, eds.), (Berlin, Heidelberg), pp. 77–101, Springer Berlin Heidelberg, 1995

  23. [23]

    Fast static analysis of C++ virtual function calls,

    D. F. Bacon and P. F. Sweeney, “Fast static analysis of C++ virtual function calls,”SIGPLAN Not., vol. 31, p. 324–341, Oct. 1996

  24. [24]

    IRIS: LLM-assisted static analysis for detecting security vulnerabilities,

    Z. Li, S. Dutta, and M. Naik, “IRIS: LLM-assisted static analysis for detecting security vulnerabilities,” inInternational Conference on Learning Representations(Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, eds.), vol. 2025, pp. 35735–35758, 2025

  25. [25]

    DARPA AI Cyber Challenge (AIxCC)

    Defense Advanced Research Projects Agency (DARPA), “DARPA AI Cyber Challenge (AIxCC).” https://aicyberchallenge.com/, 2023. Accessed: 2025-11-13

  26. [26]

    Amazon EC2 on-demand pricing

    Amazon Web Services, “Amazon EC2 on-demand pricing.” https: //aws.amazon.com/ec2/pricing/on-demand/. Accessed: 2025-11-13

  27. [27]

    SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Archi- tectures, and Lessons Learned,

    C. Zhang, Y . Park, F. Fleischer, Y .-F. Fu, J. Kim, D. Kim, Y . Kim, Q. Xu, A. Chin, Z. Sheng, H. Zhao, B. J. Lee, J. Wang, M. Pel- ican, D. J. Musliner, J. Huang, J. Silliman, M. Mcdaniel, J. Casa- vant, I. Goldthwaite, N. Vidovich, M. Lehman, and T. Kim, “SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Archi- tectures, and Lessons Learned,” 2026

  28. [28]

    OSS- Fuzz-Gen: Automated fuzz target generation

    D. Liu, O. Chang, J. Metzman, M. Sablotny, and M. Maruseac, “OSS- Fuzz-Gen: Automated fuzz target generation.” https://github.com/goo gle/oss-fuzz-gen, 2024

  29. [29]

    ysoserial: A proof-of-concept tool for generating payloads that exploit unsafe Java object deserializa- tion

    C. Frohoff, G. Lawrence,et al., “ysoserial: A proof-of-concept tool for generating payloads that exploit unsafe Java object deserializa- tion.” https://github.com/frohoff/ysoserial, 2015

  30. [30]

    recheck: The trustworthy ReDoS checker

    Makenowjust Labs, “recheck: The trustworthy ReDoS checker.” https: //makenowjust-labs.github.io/recheck/. Accessed: 2025

  31. [31]

    Atheris: A coverage-guided, native Python fuzzer

    Google, “Atheris: A coverage-guided, native Python fuzzer.” https: //github.com/google/atheris, 2020

  32. [32]

    go-fuzz: Randomized testing for Go

    D. Vyukov, “go-fuzz: Randomized testing for Go.” https://github.c om/dvyukov/go-fuzz, 2015

  33. [33]

    A hybrid analysis to detect Java se- rialisation vulnerabilities,

    S. Rasheed and J. Dietrich, “A hybrid analysis to detect Java se- rialisation vulnerabilities,” in2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1209– 1213, 2020

  34. [34]

    Improving Java deserialization gadget chain mining via overriding-guided object generation,

    S. Cao, X. Sun, X. Wu, L. Bo, B. Li, R. Wu, W. Liu, B. He, Y . Ouyang, and J. Li, “Improving Java deserialization gadget chain mining via overriding-guided object generation,” inProceedings of the 45th International Conference on Software Engineering, ICSE ’23, p. 397–409, IEEE Press, 2023

  35. [35]

    Automated discovery of deserialization gadget chains,

    I. Haken, “Automated discovery of deserialization gadget chains,” in Black Hat USA, 2018

  36. [36]

    Acquirer: A hybrid approach to detecting algorithmic complexity vulnerabilities,

    Y . Liu and W. Meng, “Acquirer: A hybrid approach to detecting algorithmic complexity vulnerabilities,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Secu- rity, CCS ’22, (New York, NY , USA), p. 2071–2084, Association for Computing Machinery, 2022

  37. [37]

    HotFuzz: Discovering temporal and spatial denial-of-service vulnerabilities through guided micro-fuzzing,

    W. Blair, A. Mambretti, S. Arshad, M. Weissbacher, W. Robertson, E. Kirda, and M. Egele, “HotFuzz: Discovering temporal and spatial denial-of-service vulnerabilities through guided micro-fuzzing,”ACM Trans. Priv. Secur., vol. 25, July 2022

  38. [38]

    UChecker: Automatically detecting PHP-based unrestricted file upload vulnerabilities,

    J. Huang, Y . Li, J. Zhang, and R. Dai, “UChecker: Automatically detecting PHP-based unrestricted file upload vulnerabilities,” in2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 581–592, 2019

  39. [39]

    FUSE: Finding file upload bugs via penetration testing,

    T. Lee, S. Wi, S. Lee, and S. Son, “FUSE: Finding file upload bugs via penetration testing,” inNetwork and Distributed System Security Symposium (NDSS), 2020

  40. [40]

    Undefined-oriented programming: De- tecting and chaining prototype pollution gadgets in Node.js template engines for malicious consequences,

    Z. Liu, K. An, and Y . Cao, “Undefined-oriented programming: De- tecting and chaining prototype pollution gadgets in Node.js template engines for malicious consequences,” in2024 IEEE Symposium on Security and Privacy (SP), pp. 4015–4033, 2024

  41. [41]

    Detecting Node.js prototype pollution vulnerabilities via object lookup analysis,

    S. Li, M. Kang, J. Hou, and Y . Cao, “Detecting Node.js prototype pollution vulnerabilities via object lookup analysis,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, (New York, NY , USA), p. 268–279, Association for Computing Machinery, 2021

  42. [42]

    Silent spring: Pro- totype pollution leads to remote code execution in Node.js,

    M. Shcherbakov, M. Balliu, and C.-A. Staicu, “Silent spring: Pro- totype pollution leads to remote code execution in Node.js,” in Proceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, (USA), USENIX Association, 2023

  43. [43]

    webFuzz: Grey-box fuzzing for web ap- plications,

    O. van Rooij, M. A. Charalambous, D. Kaizer, M. Papaevripides, and E. Athanasopoulos, “webFuzz: Grey-box fuzzing for web ap- plications,” inComputer Security – ESORICS 2021: 26th European Symposium on Research in Computer Security, Darmstadt, Ger- many, October 4–8, 2021, Proceedings, Part I, (Berlin, Heidelberg), p. 152–172, Springer-Verlag, 2021

  44. [44]

    Backrest: A model-based feedback-driven greybox fuzzer for web applications,

    F. Gauthier, B. Hassanshahi, B. Selwyn-Smith, T. N. Mai, M. Schl¨uter, and M. Williams, “Backrest: A model-based feedback-driven greybox fuzzer for web applications,”arXiv preprint arXiv:2108.08455, 2021

  45. [45]

    Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

    Y . Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, (New York, NY , USA), p. 423–435, Association for Computing Machinery, 2023

  46. [46]

    Fuzz4ALL: Universal fuzzing with large language models,

    C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Fuzz4ALL: Universal fuzzing with large language models,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), (Los Alamitos, CA, USA), pp. 1547–1559, IEEE Computer Society, Apr. 2024

  47. [47]

    CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,

    C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,” inProceedings of the 45th International Confer- ence on Software Engineering, ICSE ’23, p. 919–931, IEEE Press, 2023

  48. [48]

    Harnessing large language models for seed generation in greybox fuzzing,

    W. Shi, Y . Zhang, X. Xing, and J. Xu, “Harnessing large language models for seed generation in greybox fuzzing,”arXiv preprint arXiv:2411.18143, 2024

  49. [49]

    C. Chen, B. Dolan-Gavitt, and Z. Lin,ELFUZZ: Efficient input gen- eration via LLM-driven synthesis over fuzzer space. USA: USENIX Association, 2025

  50. [50]

    Augmenting greybox fuzzing with generative AI,

    J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing with generative AI,”arXiv preprint arXiv:2306.06782, 2023

  51. [51]

    Large language model guided protocol fuzzing,

    R. Meng, M. Mirchev, M. B ¨ohme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inProceedings of the 31st Annual Network and Distributed System Security Symposium, NDSS 2024, 2024

  52. [52]

    WhiteFox: White-box compiler fuzzing empowered by large lan- guage models,

    C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbarvand, and L. Zhang, “WhiteFox: White-box compiler fuzzing empowered by large lan- guage models,”Proc. ACM Program. Lang., vol. 8, Oct. 2024

  53. [53]

    C. Yang, Z. Zhao, and L. Zhang,KernelGPT: Enhanced kernel fuzzing via large language models, p. 560–573. New York, NY , USA: Association for Computing Machinery, 2025

  54. [54]

    OSS- Fuzz-Gen: Automated fuzz target generation,

    D. Liu, O. Chang, J. metzman, M. Sablotny, and M. Maruseac, “OSS- Fuzz-Gen: Automated fuzz target generation,” May 2024

  55. [55]

    How effective are they? Exploring large language model based fuzz driver generation,

    C. Zhang, Y . Zheng, M. Bai, Y . Li, W. Ma, X. Xie, Y . Li, L. Sun, and Y . Liu, “How effective are they? Exploring large language model based fuzz driver generation,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, (New York, NY , USA), p. 1223–1235, Association for Computing Machinery, 2024

  56. [56]

    Prompt fuzzing for fuzz driver generation,

    Y . Lyu, Y . Xie, P. Chen, and H. Chen, “Prompt fuzzing for fuzz driver generation,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, (New York, NY , USA), p. 3793–3807, Association for Computing Machinery, 2024. Appendix A. Benchmark Contamination Analysis Table 11 summarizes the composition and contami...

  57. [57]

    The paper provides a valuable step forward in an established field via a coherent end-to-end design tar- geting the reachability–exploitability gap in sink-based vulnerability discovery

  58. [58]

    The paper creates a new tool to enable future science which provides substantial improvement over the cho- sen baseline, supported by ablations analysis

  59. [59]

    The new tool offers potential community value if the framework and benchmark are released. D.4. Noteworthy Concerns

  60. [60]

    Internal precision is very low ( ∼14%), implying many false positives and unclear practical reliability

  61. [61]

    Some cheaper, open-source models may do better in terms of exploita- tion success, but internal working effectiveness remains overlooked, casting a shadow on performance

    Best performance relies on expensive flagship LLMs that lead to high monetary cost; cheaper models seem to significantly degrade performance. Some cheaper, open-source models may do better in terms of exploita- tion success, but internal working effectiveness remains overlooked, casting a shadow on performance

  62. [62]

    While a set of state-of-the-art industry tools is considered, those academic works are overlooked, leaving the technical advancement unclear

    Comparison is limited to Jazzer, lacking evaluation against state-of-the-art analyzers (e.g., IRIS@ICLR’25, RepoAudit@ICML’25, LLMDFA@NeuIPS’24) and stronger fuzzers (e.g., PolyFuzz@USENIX Secu- rity’23). While a set of state-of-the-art industry tools is considered, those academic works are overlooked, leaving the technical advancement unclear. Appendix E...