pith. sign in

arxiv: 2605.21824 · v1 · pith:2RQFTWFYnew · submitted 2026-05-20 · 💻 cs.CR · cs.SE

Quality-Assured Fuzz Harness Generation via the Four Principles Framework

Pith reviewed 2026-05-22 08:14 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords fuzz testingharness generationLLM agentssoftware securitymemory safetybug detectionautomated testingquality assurance
0
0 comments X

The pith

The Four Principles framework gives the first source-level definition of correct fuzz harnesses with enforceable checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that defines what makes a fuzz harness correct at the source code level before any testing runs. It specifies four principles with mathematical details and practical checks: Logic Correctness, API Protocol Compliance, Security Boundary Respect, and Entry Point Adequacy. An autonomous LLM agent applies these through repeated generate, check, and fix steps to produce harnesses that avoid introducing their own errors. On 23 projects the method intercepted 58 bad harnesses that would have produced false crashes and delivered 29 confirmed bugs including three CVEs at a 4.8 percent false-positive rate. Readers care because reliable harnesses let fuzzing campaigns focus on real memory-safety problems instead of chasing harness mistakes.

Core claim

The Four Principles framework supplies the first source-level definition of harness correctness with mathematical specifications and implementable checks. An autonomous LLM agent produces harnesses satisfying the four principles via a generate-check-fix loop before fuzzing begins, intercepting harness-induced crashes and yielding confirmed bugs with low false-positive rates across C/C++, Java, and JavaScript projects.

What carries the argument

The Four Principles framework (P1 Logic Correctness, P2 API Protocol Compliance, P3 Security Boundary Respect, P4 Entry Point Adequacy) with its mathematical specifications and implementable source-level checks, embedded in a generate-check-fix loop inside an autonomous LLM agent.

If this is right

  • Harnesses can be validated for correctness at the source level before any fuzz campaign starts.
  • LLM-driven generation can scale while keeping false-positive crashes low through built-in checks.
  • Existing production harnesses can be audited for violations and repaired systematically.
  • Fuzzing results become more trustworthy because harness errors are removed upstream.
  • The approach extends quality assurance to harnesses for multiple languages including C/C++, Java, and JavaScript.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle-based checks could be adapted to generate or audit test drivers outside fuzzing, such as for API testing or integration tests.
  • Integrating the checks into developer tools might reduce harness errors even when humans write them manually.
  • Common violation patterns identified by the checks could guide better API documentation or library design to prevent harness mistakes.
  • The framework's emphasis on security boundaries might generalize to other safety properties in automated test generation.

Load-bearing premise

The four principles together cover every correctness property that matters for a fuzz harness, so passing the automated checks is enough to ensure the harness will not create false positives or hide real bugs during later fuzzing.

What would settle it

A harness that satisfies all four principle checks yet still triggers crashes unrelated to the target code, or a harness that violates one principle yet produces only true-positive findings when used in fuzzing.

Figures

Figures reproduced from arXiv: 2605.21824 by Dmitrijs Trizna, Jeff Huang, Luigino Camastra, Qingxiao Xu, Ze Sheng, Zhicheng Chen.

Figure 1
Figure 1. Figure 1: Developer responses when contacted about AI [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example Logic Group generated by QuartetFuzz [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: System overview. Stage 1: the Logic Group agent explores the project, identifies candidate functionalities, and ranks [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Condensed P2 protocol report for ICU’s createFromRules API, generated by the API Research agent. Each entry carries a project-specific claim grounded in source-cited evidence (file:line). Sub-check definitions (P2.1–P2.8) are tabulated in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example prompt pair for the same case [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Four Principles under a prompt-only Skill-style [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of the 40 real bugs (2 FPs excluded): [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Input/output of the six non-trivial SAST tools. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative P1 and P2 violations identified and repaired by the agent. Top: harfbuzz [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative RQ3 coverage win and loss vs. gold on shared entries. Left: wabt/wasm2wat ( [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Fuzz testing is the dominant technique for finding memory-safety vulnerabilities in C/C++ software, yet its effectiveness hinges on the quality of fuzz harnesses -- the programs that bridge fuzzers and library APIs. A growing body of tools now automate harness generation, but none systematically ensures the correctness of produced harnesses: logic errors, API misuse, and lifecycle violations go undetected at the source level. As LLM-driven generation scales harness creation, uncontrolled quality turns scale into a liability. We present QuartetFuzz, an autonomous harness-generation system that systematically improves correctness throughout the generation process. At its core is the Four Principles framework -- Logic Correctness (P1), API Protocol Compliance (P2), Security Boundary Respect (P3), and Entry Point Adequacy (P4) -- the first source-level definition of harness correctness with mathematical specifications and implementable checks. We operationalize these principles in an autonomous LLM agent that produces harnesses satisfying P1-P4 through a generate-check-fix loop before any fuzzing begins. Deployed on 23 open-source projects spanning C/C++, Java, and JavaScript, the system submits 42 bug reports, of which 29 are fixed or confirmed upstream (including 3 CVEs) and only 2 are rejected (4.8% FP rate). During generation, the built-in P1/P2 checks automatically intercepted 58 harness-induced crashes that would otherwise have been false positives. Applied as a quality auditor to 586 existing production harnesses across 70 projects, the system identifies 53 violations (45 confirmed, 35 fixed). We release a dataset of 100 labeled harnesses for reproducible evaluation. Code and dataset are available at https://github.com/OwenSanzas/QuartetFuzz

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents QuartetFuzz, an autonomous LLM agent for generating fuzz harnesses that satisfy the Four Principles framework (Logic Correctness P1, API Protocol Compliance P2, Security Boundary Respect P3, and Entry Point Adequacy P4). These principles are claimed to provide the first source-level mathematical definition of harness correctness with implementable checks. The system uses a generate-check-fix loop and is evaluated on 23 projects, yielding 42 bug reports (29 confirmed or fixed, including 3 CVEs) at a 4.8% false-positive rate while intercepting 58 harness-induced crashes; it is also applied as an auditor to 586 existing harnesses across 70 projects, identifying 53 violations. A dataset of 100 labeled harnesses is released.

Significance. If the central claim that satisfying the automated P1-P4 checks suffices to prevent false positives and missed bugs holds, the work would provide a practical advance in automated harness generation for fuzz testing of C/C++, Java, and JavaScript libraries. The empirical results on real upstream projects, the low reported FP rate, and the public release of code and a labeled dataset are concrete strengths that support reproducibility and further research in security testing.

major comments (2)
  1. [Abstract and section defining the Four Principles framework] The central claim that the Four Principles together capture all relevant correctness properties (so that satisfying the checks guarantees no false positives in later fuzzing) lacks a soundness argument, exhaustive enumeration of harness error classes, or explicit treatment of defects outside P1-P4 such as incorrect global-state reset between iterations or mishandled library-internal callbacks. This is load-bearing for the reported 4.8% FP rate and the guarantee against harness-induced crashes.
  2. [Evaluation section (methods and results)] The abstract reports concrete outcomes (42 bug reports, 29 confirmed, 58 intercepted crashes, 4.8% FP rate) but provides no detail on how the P1-P2 checks were implemented, what static or dynamic analyses were used, or how false-positive classification was performed. This directly affects the soundness and reproducibility of the empirical results on the 23 projects.
minor comments (2)
  1. [Four Principles framework] The notation and mathematical specifications for P1-P4 would benefit from additional concrete code examples or pseudocode to improve clarity for readers implementing the checks.
  2. [Auditing results] Table or figure presenting the breakdown of the 53 violations found in the 586 existing harnesses (e.g., by principle and project) is missing and would strengthen the auditing results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. Below we provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and section defining the Four Principles framework] The central claim that the Four Principles together capture all relevant correctness properties (so that satisfying the checks guarantees no false positives in later fuzzing) lacks a soundness argument, exhaustive enumeration of harness error classes, or explicit treatment of defects outside P1-P4 such as incorrect global-state reset between iterations or mishandled library-internal callbacks. This is load-bearing for the reported 4.8% FP rate and the guarantee against harness-induced crashes.

    Authors: We acknowledge that our presentation of the Four Principles would benefit from a more explicit discussion of their completeness and scope. The principles were developed by categorizing common harness defects reported in the literature and observed in practice across numerous fuzzing campaigns. P1 (Logic Correctness) addresses issues like incorrect state management including global resets through its mathematical specification of harness logic. P3 (Security Boundary Respect) covers boundary issues that could relate to callbacks. However, we agree that an exhaustive enumeration is challenging and a formal soundness proof would be ideal but is left for future work. In the revision, we will add a subsection in the framework definition that discusses potential defects outside the current checks and how they might manifest, while maintaining that the low empirical FP rate supports the practical utility of the checks. revision: partial

  2. Referee: [Evaluation section (methods and results)] The abstract reports concrete outcomes (42 bug reports, 29 confirmed, 58 intercepted crashes, 4.8% FP rate) but provides no detail on how the P1-P2 checks were implemented, what static or dynamic analyses were used, or how false-positive classification was performed. This directly affects the soundness and reproducibility of the empirical results on the 23 projects.

    Authors: We apologize for the insufficient detail in the current draft. The implementation of the checks involves a combination of static analysis using tools like Clang for C/C++ to verify API calls and logic, and dynamic execution in a sandboxed environment to detect crashes during the check phase. For false-positive classification, we followed a standard process of reporting to upstream maintainers and classifying based on their responses (confirmed, fixed, rejected). We will revise the Evaluation section to include detailed descriptions, possibly with figures or algorithms, of the check implementations and the classification criteria to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical system with external validation shows no significant circularity

full rationale

The paper describes an empirical harness-generation system and its evaluation on 23 external open-source projects plus 586 existing harnesses. Central results consist of counts of intercepted crashes, confirmed upstream bugs, and fixed violations, all measured against independent upstream project responses rather than internal fitted parameters or self-referential equations. The Four Principles are introduced as a new definitional framework with implementable checks, but the reported outcomes (29 confirmed bugs, 4.8% FP rate) do not reduce to those definitions by construction. No load-bearing self-citation chains, uniqueness theorems, or ansatzes appear in the provided description; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the four named principles constitute a complete and checkable definition of harness correctness; no free parameters or invented physical entities are introduced, but the principles themselves function as domain assumptions whose coverage is not independently verified in the abstract.

axioms (1)
  • domain assumption The Four Principles (P1 Logic Correctness, P2 API Protocol Compliance, P3 Security Boundary Respect, P4 Entry Point Adequacy) together define all necessary correctness properties for a fuzz harness.
    Invoked in the abstract as the core of the framework with 'mathematical specifications and implementable checks'.

pith-pipeline@v0.9.0 · 5867 in / 1528 out tokens · 30830 ms · 2026-05-22T08:14:36.833692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    Adalogics and Google. 2022. Fuzz Introspector: A Tool for Analyzing and Visual- izing Fuzz Coverage. https://github.com/ossf/fuzz-introspector

  2. [2]

    Georgios Androutsopoulos and Antonio Bianchi. 2025. deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM-Augmented Har- nesses. arXiv:2506.15648 [cs.CR] https://arxiv.org/abs/2506.15648

  3. [3]

    Anthropic. 2024. Model Context Protocol. https://modelcontextprotocol.io

  4. [4]

    Anthropic. 2025. Agent Skills. https://docs.claude.com/en/docs/agents-and- tools/agent-skills/overview

  5. [5]

    Domagoj Babić, Stefan Bucur, Yaohui Chen, Franjo Ivančić, Tim King, Markus Kusano, Caroline Lemieux, László Szekeres, and Wei Wang. 2019. FUDGE: Fuzz Driver Generation at Scale. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

  6. [6]

    Max Brunsfeld. 2018. Tree-sitter: An Incremental Parsing System for Program- ming Tools. https://tree-sitter.github.io/tree-sitter/

  7. [7]

    Peng Chen, Yuxuan Xie, Yunlong Lyu, Yuxiao Wang, and Hao Chen. 2023. Hopper: Interpretative Fuzzing for Libraries. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1600–1614

  8. [8]

    Weiteng Chen, Yu Wang, Zheng Zhang, and Zhiyun Qian. 2021. SyzGen: Auto- mated Generation of Syscall Specification of Closed-Source macOS Drivers. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 749–763

  9. [9]

    Yiran Cheng, Hong Jin Kang, Lwin Khin Shar, Chaopeng Dong, Zhiqiang Shi, Shichao Lv, and Limin Sun. 2025. Towards Reliable LLM-Driven Fuzz Testing: Vision and Road Ahead.arXiv preprint arXiv:2503.00795(2025)

  10. [10]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep- Learning Libraries via Large Language Models. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 423–435

  11. [11]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547

  12. [12]

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++: Combining Incremental Steps of Fuzzing Research. In14th USENIX Workshop on Offensive Technologies (WOOT 20)

  13. [13]

    FIRST. 2019. Common Vulnerability Scoring System Version 3.1: Specification Document. https://www.first.org/cvss/v3-1/

  14. [14]

    GitHub. 2019. CodeQL: Variant Analysis for Code Security. https://codeql.github. com/

  15. [15]

    Google. 2016. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https: //github.com/google/oss-fuzz

  16. [16]

    Google. 2019. ClusterFuzz: Scalable Fuzzing Infrastructure. https://google.github. io/clusterfuzz/

  17. [17]

    Google Sanitizers. 2013. LeakSanitizer: A Memory Leak Detector built on top of AddressSanitizer. https://github.com/google/sanitizers/wiki/ AddressSanitizerLeakSanitizer

  18. [18]

    Harrison Green and Thanassis Avgerinos. 2022. GraphFuzz: Library API Fuzzing with Lifetime-aware Dataflow Graphs. InProceedings of the 44th International Conference on Software Engineering. 1070–1081

  19. [19]

    Ispoglou, Daniel Austin, Vishwath Mohan, and Mathias Payer

    Kyriakos K. Ispoglou, Daniel Austin, Vishwath Mohan, and Mathias Payer. 2020. FuzzGen: Automatic Fuzzer Generation. InProceedings of the 29th USENIX Con- ference on Security Symposium (SEC’20). USENIX Association, USA, Article 128, 17 pages

  20. [20]

    Bokdeuk Jeong, Joonun Jang, Hayoon Yi, Jiin Moon, Junsik Kim, Intae Jeon, Taesoo Kim, WooChul Shim, and Yong Ho Hwang. 2023. UTopia: Automatic Generation of Fuzz Driver using Unit Tests. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2676–2692

  21. [21]

    Jinho Jung, Stephen Tong, Hong Hu, Jungwon Lim, Yonghwi Jin, and Taesoo Kim. 2021. WINNIE: Fuzzing Windows Applications with Harness Synthesis and Fast Cloning. InProceedings of the 2021 Network and Distributed System Security Symposium (NDSS 2021)

  22. [22]

    George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 2123–2138. doi:10.1145/3243734.3243804

  23. [23]

    Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. InProceedings of the International Symposium on Code Generation and Optimization (CGO). 75–86. doi:10.1109/ CGO.2004.1281665

  24. [24]

    Yan Li, Wenzhang Yang, Yuekun Wang, Jian Gao, Shaohua Wang, Yinxing Xue, and Lijun Zhang. 2025. Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling. arXiv:2507.18289 [cs.SE] https://arxiv.org/abs/2507.18289

  25. [25]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InInternational Conference on Learning Representations, Vol. 2025. 35735–35758

  26. [26]

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. InProceedings of the 2018 Network and Distributed System Security Symposium (NDSS 2018). doi:10.14722/ndss.2018.23158

  27. [27]

    Jiayi Lin, Qingyu Zhang, Junzhe Li, Chenxin Sun, Hao Zhou, Changhua Luo, and Chenxiong Qian. 2025. Automatic Library Fuzzing through API Relation Evolvement. InProceedings of the 2025 Network and Distributed System Security Symposium (NDSS 2025)

  28. [28]

    Dongge Liu, Oliver Chang, Jonathan Metzman, Martin Sablotny, and Mihai Maruseac. 2024. OSS-Fuzz-Gen: Automated Fuzz Target Generation via Large Language Models. https://github.com/google/oss-fuzz-gen

  29. [29]

    Yuwei Liu, Junquan Deng, Xiangkun Jia, Yanhao Wang, Minghua Wang, Lin Huang, Tao Wei, and Purui Su. 2025. PromeFuzz: A Knowledge-Driven Approach to Fuzzing Harness Generation with Large Language Models. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. 1559–1573

  30. [30]

    Yuwei Liu, Yanhao Wang, Xiangkun Jia, Zheng Zhang, and Purui Su. 2024. AFGen: Whole-Function Fuzzing for Applications and Libraries. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 1901–1919

  31. [31]

    LLVM Project. 2015. libFuzzer: A Library for Coverage-Guided Fuzz Testing. https://llvm.org/docs/LibFuzzer.html

  32. [32]

    Yunlong Lyu, Yuxuan Xie, Peng Chen, and Hao Chen. 2024. Prompt Fuzzing for Fuzz Driver Generation. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3793–3807

  33. [33]

    Microsoft. 2022. Language Server Protocol Specification 3.17. https://microsoft. github.io/language-server-protocol/

  34. [34]

    Miller, Lars Fredriksen, and Bryan So

    Barton P. Miller, Lars Fredriksen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities.Commun. ACM33, 12 (Dec. 1990), 32–44. doi:10. 1145/96267.96279

  35. [35]

    Kromrey, Jesse Coraggio, and Jeff Skowronek

    Jeanine Romano, Jeffrey D. Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate Statistics for Ordinal Level Data: Should We Really Be Using t-Test and Cohen’s d for Evaluating Group Differences on the NSSE and Other Surveys?. InAnnual Meeting of the Florida Association for Institutional Research. Cocoa Beach, FL

  36. [36]

    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX Association, Boston, MA, 309–318. https://www.usenix.org/conference/atc12/technical-sessions/ presentation/serebryany

  37. [37]

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang

  38. [38]

    Surveys58, 5, Article 134 (2025)

    LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights.Comput. Surveys58, 5, Article 134 (2025). doi:10.1145/3769082

  39. [39]

    Ze Sheng, Fenghua Wu, Xiangwu Zuo, Chao Li, Yuxin Qiao, and Lei Hang

  40. [40]

    LProtector: An LLM-driven Vulnerability Detection System.arXiv preprint arXiv:2411.06493(2024)

  41. [41]

    Donaldson, Guofei Gu, and Jeff Huang

    Ze Sheng, Qingxiao Xu, Jianwei Huang, Matthew Woodcock, Heqing Huang, Alastair F. Donaldson, Guofei Gu, and Jeff Huang. 2025. All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching. arXiv:2509.07225 [cs.CR] https://arxiv.org/abs/2509.07225

  42. [42]

    Gabriel Sherman and Stefan Nagy. 2025. No Harness, No Problem: Oracle-Guided Harnessing for Auto-Generating C API Fuzzing Harnesses. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 165–177

  43. [43]

    Yulei Sui and Jingling Xue. 2016. SVF: Interprocedural Static Value-Flow Anal- ysis in LLVM. InProceedings of the 25th International Conference on Compiler Construction. 265–266

  44. [44]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Liu, and Yingjiu Li. 2024. LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning.arXiv preprint arXiv:2401.16185 (2024)

  45. [45]

    Flavio Toffalini, Nicolas Badoux, Zurab Tsinadze, and Mathias Payer. 2025. Liber- ating Libraries through Automated Fuzz Driver Generation: Striking a Balance without Consumer Code.Proceedings of the ACM on Software Engineering2, FSE, Article FSE095 (June 2025), 23 pages. Conference’17, July 2017, Washington, DC, USA Ze Sheng, Dmitrijs Trizna, Luigino Cam...

  46. [46]

    Wei-Cheng Wu, Stefan Nagy, and Christophe Hauser. 2025. WildSync: Automated Fuzzing Harness Synthesis via Wild API Usage Recovery.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 963–984

  47. [47]

    Hanxiang Xu, Wei Ma, Ting Zhou, Yanjie Zhao, Kai Chen, Qiang Hu, Yang Liu, and Haoyu Wang. 2025. CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge Graph. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 243–254

  48. [48]

    Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In2014 IEEE Symposium on Security and Privacy. IEEE, 590–604. doi:10.1109/SP.2014.44

  49. [49]

    Chenyuan Yang, Zijie Zhao, and Lingming Zhang. 2025. KernelGPT: Enhanced Kernel Fuzzing via Large Language Models. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 560–573

  50. [50]

    Kang Yang, Yunhang Zhang, Zichuan Li, Guanhong Tao, Jun Xu, and Xiaojing Liao. 2025. HarnessAgent: Scaling Automatic Fuzzing Harness Construction with Tool-Augmented LLM Pipelines. arXiv:2512.03420 [cs.CR] https://arxiv.org/abs/ 2512.03420

  51. [51]

    Cen Zhang, Yuekang Li, Hao Zhou, Xiaohan Zhang, Yaowen Zheng, Xian Zhan, Xiaofei Xie, Xiapu Luo, Xinghua Li, Yang Liu, and Sheikh Mahbub Habib

  52. [52]

    In32nd USENIX Security Symposium (USENIX Security 23)

    Automata-Guided Control-Flow-Sensitive Fuzz Driver Generation. In32nd USENIX Security Symposium (USENIX Security 23). 2867–2884

  53. [53]

    Cen Zhang, Xingwei Lin, Yuekang Li, Yinxing Xue, Jundong Xie, Hongxu Chen, Xinlei Ying, Jiashui Wang, and Yang Liu. 2021. APICraft: Fuzz Driver Generation for Closed-source SDK Libraries. In30th USENIX Security Symposium (USENIX Security 21). 2811–2828

  54. [54]

    Lee, Joshua Wang, Michael Pelican, David J

    Cen Zhang, Younggi Park, Fabian Fleischer, Yu-Fu Fu, Jiho Kim, Dongkwan Kim, Youngjoon Kim, Qingxiao Xu, Andrew Chin, Ze Sheng, Hanqing Zhao, Brian J. Lee, Joshua Wang, Michael Pelican, David J. Musliner, Jeff Huang, Jon Sil- liman, Mikel Mcdaniel, Jefferson Casavant, Isaac Goldthwaite, Nicholas Vidovich, Matthew Lehman, and Taesoo Kim. 2026. SoK: DARPA’s...

  55. [55]

    Cen Zhang, Yaowen Zheng, Mingqiang Bai, Yeting Li, Wei Ma, Xiaofei Xie, Yuekang Li, Limin Sun, and Yang Liu. 2024. How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1223– 1235

  56. [56]

    Mingrui Zhang, Jianzhong Liu, Fuchen Ma, Huafeng Zhang, and Yu Jiang. 2021. IntelliGen: Automatic Driver Synthesis for Fuzz Testing. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 318–327

  57. [57]

    Mingrui Zhang, Chijin Zhou, Jianzhong Liu, Mingzhe Wang, Jie Liang, Juan Zhu, and Yu Jiang. 2023. Daisy: Effective Fuzz Driver Synthesis with Object Usage Sequence Analysis. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 87–98

  58. [58]

    dummy filename

    Tianming Zheng, Fanchao Meng, Ping Yi, and Yue Wu. 2026. Automating fuzz driver generation for deep learning libraries with large language models.Cyber- security9 (1 2026). doi:10.1186/s42400-025-00532-9 9 Open Science We release the artifacts that support the paper’s claims through three repositories covering the system, the dataset, and the two re-runna...