FuzzPilot: Plateau-Triggered Recipe Validation for Structured Text Fuzzing

Zhiyi Yao (Qingdao University of Technology)

arxiv: 2605.26539 · v1 · pith:HZ3FKIMZnew · submitted 2026-05-25 · 💻 cs.SE · cs.CR

FuzzPilot: Plateau-Triggered Recipe Validation for Structured Text Fuzzing

Zhiyi Yao (Qingdao University of Technology) This is my paper

Pith reviewed 2026-06-29 20:36 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords fuzzingAFL++mutation recipesplateau detectionrecipe validationlanguage modelcJSONstructured text

0 comments

The pith

FuzzPilot tests candidate mutation recipes in short isolated AFL++ campaigns only when coverage plateaus and promotes none of twenty model proposals on cJSON.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FuzzPilot as a controller layered on AFL++ that detects coverage plateaus, snapshots the corpus, and runs candidate recipes through brief validation micro-campaigns before deciding whether to adopt them. Recipes are stored as JSON structures of operator weights, byte ranges, and dictionary tokens rather than generated code. The narrow evaluation on the already-saturated cJSON target shows that the system maintains execution throughput near baseline levels while producing a shorter median plateau time, but the difference is not statistically significant. No recipes were promoted because all returned zero reward, and the authors attribute any plateau improvement to the snapshot-restart mechanism instead of the model or mutator.

Core claim

FuzzPilot snapshots the corpus upon plateau detection, prepares candidate recipes from local rules or a language-model agent supplied with Ghidra-derived constants, evaluates each recipe inside a short isolated AFL++ micro-campaign, and promotes only those yielding positive validation reward. In five 14,400-second repetitions against vanilla AFL++ on cJSON, the validation gate examined twenty model-proposed recipes and promoted none; the observed median plateau reduction from 2,532 s to 1,384 s is therefore attributed to the controller's snapshot and restart machinery rather than to the recipes themselves.

What carries the argument

The plateau-triggered validation gate that launches short isolated AFL++ micro-campaigns to compute reward for JSON-encoded candidate recipes before promotion.

If this is right

Throughput stays comparable to vanilla AFL++ with median execs-per-second at roughly 1.06 times baseline.
Median plateau duration shortens but the change is not statistically significant at N=5.
No model-proposed recipes receive positive reward on the saturated cJSON target.
The architecture separates expensive reasoning steps from the mutation hot path without throughput loss in the reported setting.
The cJSON results serve as an auditable baseline for later tests on programs that have not reached coverage saturation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Snapshot and restart machinery alone may be sufficient to produce the reported reduction in plateau length.
Validation signals collected on saturated targets are unlikely to predict behavior on programs still capable of substantial coverage growth.
Model proposals may require different reward definitions or richer target context before any recipe earns promotion.

Load-bearing premise

Short isolated micro-campaigns on a saturated target like cJSON can provide a reliable signal for whether recipe validation would improve coverage on non-saturated, more complex programs.

What would settle it

A run on a non-saturated target in which at least one model-proposed recipe receives positive reward, is promoted, and produces higher final edge coverage than a snapshot-restart-only control arm.

Figures

Figures reproduced from arXiv: 2605.26539 by Zhiyi Yao (Qingdao University of Technology).

**Figure 2.** Figure 2: Recipe lifecycle. Each LLM proposal is first schema-validated, then evaluated in an [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Edge-discovery trajectory on cJSON for baseline-afl (E1a, N=5) and full-agent (E1b, N=5), zoomed into the 264–270 edge band after the initial ramp. Both modes converge to the same 269-edge ceiling; the difference is timing, not final coverage. The dashed verticals mark the per-mode median last find (11,911 s vs 13,059 s), showing that full-agent’s productive phase extends closer to the budget end. Mode med… view at source ↗

**Figure 4.** Figure 4: cycles done per run across the two modes (N=5 each). Solid horizontal bars are per-mode medians. full-agent completes ≈ 5.7× fewer corpus cycles in the same 14,400 s budget, despite an equal-or-higher total execution count ( [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Per-run plateau-onset timeline for E1a (top, blue) and E1b (bottom, orange). Solid bars [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: End-to-end execs per sec per run for both modes (N=5 each); horizontal bars are per-mode medians. Even the slowest full-agent rep (13,540) exceeds every baseline-afl rep except r05 (13,854, also a solo-occupancy run). The median ratio 1.059× satisfies the 0.85 acceptance gate; we do not interpret the point estimate as a speedup claim at N=5. Mode n mean exec/s median exec/s SD (CV) baseline-afl (E1a) 5 13,… view at source ↗

**Figure 7.** Figure 7: Mutator-only throughput by configuration (mean [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

FuzzPilot is a controller for AFL++ that moves expensive reasoning out of the mutation hot path. When coverage plateaus, it snapshots the corpus, prepares candidate mutation recipes, evaluates them in short isolated AFL++ micro-campaigns, and promotes only recipes with positive validation reward. Recipes are JSON data, not generated code: a native custom mutator consumes operator weights, byte ranges, corpus-selection rules, and dictionary tokens. Candidate recipes can come from local rules or from a language-model agent, with Ghidra-derived constants and decompiled context as target hints. This preprint reports a deliberately narrow cJSON evaluation. We compare vanilla AFL++ and the full FuzzPilot agent over five 14,400 s repetitions per arm. cJSON is saturated: baseline AFL++ reaches the exposed 269-edge ceiling at a median of about 2,500 s. The experiments therefore do not show that language-model proposals improve coverage or generalize beyond cJSON. Within this scope, FuzzPilot preserves throughput (median execs_per_sec about 1.06x baseline), shows a descriptively shorter median plateau (1,384 s versus 2,532 s), but the difference is not statistically significant at N=5 (Mann-Whitney p=0.42). The validation gate evaluated 20 model-proposed recipes and promoted none because all rewards were zero. The observed plateau reduction is more likely due to controller snapshot and restart machinery than to the model or recipe mutator. This version is best read as an auditable implementation report and baseline for ongoing non-saturated-target evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a transparent negative-result implementation report on LM recipe validation for AFL++ that found zero promotions and non-significant differences on a saturated cJSON target.

read the letter

This paper is mainly an implementation report of a negative result: their plateau-triggered recipe validation promoted none of the 20 LM-proposed recipes on cJSON, and the shorter plateau is more likely from the controller's snapshot/restart than from the model.

FuzzPilot moves reasoning out of the mutation loop by validating JSON recipes in short isolated campaigns when coverage plateaus. Recipes include operator weights and such, and can be suggested by an LM agent using Ghidra hints. The native mutator then uses the promoted recipes.

It handles the evaluation transparently, with the abstract laying out the non-significant p-value, zero promotions, and the saturated target caveat. That honesty is the main strength here.

The new part is the specific recipe validation gate and the JSON format for mutator parameters, though the core idea of adaptive controllers isn't novel.

The limitation is obvious from the design: testing only on a target that saturates fast means we learn little about whether this would help real-world fuzzing where coverage keeps growing. With just five runs, the stats are thin too.

This is for fuzzing researchers who want to see a concrete attempt at LM integration and its pitfalls. It serves as a baseline for future work on non-saturated targets.

I wouldn't push for peer review; it's better as an arXiv tech note.

Referee Report

0 major / 2 minor

Summary. The manuscript presents FuzzPilot, an AFL++ controller that moves expensive reasoning (local rules or LLM-generated mutation recipes) out of the hot path by triggering only on coverage plateaus. It snapshots the corpus, evaluates candidate JSON recipes (operator weights, byte ranges, corpus rules, dictionary tokens) in short isolated micro-campaigns, and promotes only those with positive validation reward. On the saturated cJSON target (269-edge ceiling reached by baseline at median ~2,500 s), five 14,400 s repetitions per arm show preserved throughput (~1.06x execs/sec), a descriptively shorter median plateau (1,384 s vs 2,532 s), but no statistical significance (Mann-Whitney p=0.42 at N=5). The validation gate evaluated 20 model-proposed recipes and promoted none (all rewards zero); the paper attributes any plateau difference to snapshot/restart machinery rather than the model or mutator, and explicitly frames the work as a narrow-scope implementation report and baseline for future non-saturated targets.

Significance. The transparent reporting of a negative result, zero promotions, and explicit attribution to restart machinery (rather than overclaiming LLM benefits) provides a useful, auditable baseline for structured-text fuzzing research. The work credits the experimental design and statistical caveats directly, which strengthens its value as a reference point for subsequent evaluations on more complex programs.

minor comments (2)

[Abstract] Abstract and evaluation section: the reward function used for the zero-promotion outcome is referenced but not defined in the provided abstract; a one-sentence inline definition or pointer to its equation would make the negative result fully self-contained without requiring the reader to consult the full methods.
[Discussion] The manuscript notes the narrow cJSON scope and lack of generalization; a brief sentence in the discussion contrasting this saturated target with expected behavior on non-saturated programs would further clarify the intended scope of the baseline.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, the recognition of our transparent reporting of negative results and zero promotions, and the recommendation to accept. The manuscript is intentionally scoped as an implementation report and baseline on a saturated target.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an experimental implementation report comparing AFL++ variants on a saturated cJSON target. It contains no mathematical derivations, equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. All claims rest on direct measurements (execs/sec, plateau times, reward values), statistical tests (Mann-Whitney), and explicit negative findings (zero promotions). The attribution of plateau reduction to snapshot/restart machinery follows directly from the zero-reward outcome and experimental design, without reduction to prior self-citations or ansatzes. This matches the default expectation of a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces a controller architecture but relies on standard AFL++ behavior and domain assumptions about coverage measurement and micro-campaign evaluation; no free parameters, new entities, or ad-hoc axioms are introduced beyond those implicit in fuzzing practice.

axioms (2)

domain assumption Coverage plateau detection in AFL++ provides a meaningful trigger point for recipe validation
The system design depends on identifying when coverage stops improving to initiate snapshot and validation.
domain assumption Short micro-campaigns yield a valid reward signal for recipe quality
The validation step uses isolated short AFL++ runs to decide whether to promote a recipe.

pith-pipeline@v0.9.1-grok · 5819 in / 1586 out tokens · 52934 ms · 2026-06-29T20:36:41.460540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 10 canonical work pages

[1]

Malf: A multi-agent llm framework for intelligent fuzzing of industrial control protocols,

Anonymous. MALF: A multi-agent LLM framework for intelligent fuzzing of industrial control protocols.arXiv preprint arXiv:2510.02694, 2025

work page arXiv 2025
[2]

Semantic-aware fuzzing: An empirical framework for llm-guided, reasoning-driven input mutation.arXiv preprint arXiv:2509.19533, 2025

Anonymous. Semantic-aware fuzzing: An empirical framework for llm-guided, reasoning-driven input mutation.arXiv preprint arXiv:2509.19533, 2025

work page arXiv 2025
[3]

A practical guide for using statistical tests to assess randomized algorithms in software engineering

Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering (ICSE ’11), pages 1–10, 2011

2011
[4]

NAUTILUS: Fishing for deep bugs with grammars

Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, Ahmad-Reza Sadeghi, and Daniel Teuchert. NAUTILUS: Fishing for deep bugs with grammars. InProceedings of the Network and Distributed System Security Symposium (NDSS ’19), 2019

2019
[5]

REDQUEEN: Fuzzing with input-to-state correspondence

Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, and Thorsten Holz. REDQUEEN: Fuzzing with input-to-state correspondence. InProceedings of the Network and Distributed System Security Symposium (NDSS ’19), 2019

2019
[6]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’23), pages 423–435, 2023

2023
[7]

Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries

Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE ’24), 2024

2024
[8]

AFL++: Combining incremental steps of fuzzing research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++: Combining incremental steps of fuzzing research. USENIX Workshop on Offensive Technologies (WOOT),
[9]

Software v4.21c used in this paper
[10]

ChatFuzz: Augmenting greybox fuzzing with large language models.arXiv preprint arXiv:2308.11525, 2023

Jie Hu, Qian Zhang, and Heng Yin. ChatFuzz: Augmenting greybox fuzzing with large language models.arXiv preprint arXiv:2308.11525, 2023

work page arXiv 2023
[11]

Evaluating fuzz testing

George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. Evaluating fuzz testing. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS ’18), pages 2123–2138, 2018

2018
[12]

Circumventing fuzzing roadblocks with compiler transformations

laf intel. Circumventing fuzzing roadblocks with compiler transformations. https://lafintel. wordpress.com, 2016. Blog post, August 15, 2016. The laf-intel LLVM pass; integrated into AFL++ asAFL LLVM LAF *build flags. 39

2016
[13]

Hybrid fuzzing with LLM-guided input mutation and semantic feedback.arXiv preprint arXiv:2511.03995, 2025

Shiyin Lin. Hybrid fuzzing with LLM-guided input mutation and semantic feedback.arXiv preprint arXiv:2511.03995, 2025

work page arXiv 2025
[14]

OSS-Fuzz-Gen: An open framework for LLM-driven fuzz target generation.arXiv preprint arXiv:2404.14924, 2024

Dongge Liu, Jonathan Metzman, Oliver Chang, and Google OSS-Fuzz team. OSS-Fuzz-Gen: An open framework for LLM-driven fuzz target generation.arXiv preprint arXiv:2404.14924, 2024

work page arXiv 2024
[15]

Liu et al

K. Liu et al. Low-cost and comprehensive non-textual input fuzzing with LLM-synthesized input generators. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25), 2025. Also arXiv:2501.19282

work page arXiv 2025
[16]

FuzzCoder: Byte-level fuzzing test via large language model.arXiv preprint arXiv:2409.01944, 2024

Liqun Liu et al. FuzzCoder: Byte-level fuzzing test via large language model.arXiv preprint arXiv:2409.01944, 2024

work page arXiv 2024
[17]

The Mutators reloaded: Fuzzing compilers with large language model generated mutators

Conghua Ou et al. The Mutators reloaded: Fuzzing compilers with large language model generated mutators. InProceedings of the 29th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24), 2024

2024
[18]

Santosa, Alexandru R˘ azvan Caciulescu, and Abhik Roychoudhury

Van-Thuan Pham, Marcel B¨ ohme, Andrew E. Santosa, Alexandru R˘ azvan Caciulescu, and Abhik Roychoudhury. Smart greybox fuzzing. InIEEE Transactions on Software Engineering (TSE), 2021. Originally NDSS 2019 (AFLSmart); journal extension 2021

2021
[19]

Schuirmann

Donald J. Schuirmann. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657–680, 1987

1987
[20]

LLM upfront, no steady-state calls

Yuyan Sun et al. Sphinx: A language model guided generator for solver fuzzing. InProceed- ings of the International Conference on Automated Software Engineering (ASE ’24), 2024. Representative “LLM upfront, no steady-state calls” design for SMT-solver fuzzing

2024
[21]

Andr´ as Vargha and Harold D. Delaney. A critique and improvement of the CL common language effect size statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics, 25(2):101–132, 2000

2000
[22]

LLAMAFUZZ: Large language model enhanced greybox fuzzing.arXiv preprint arXiv:2406.07714, 2024

Hongxiang Wang, Xiangwei Xu, Xiaofei Xie, et al. LLAMAFUZZ: Large language model enhanced greybox fuzzing.arXiv preprint arXiv:2406.07714, 2024. Latest revision March 2026

work page arXiv 2024
[23]

Superion: Grammar-aware greybox fuzzing

Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. Superion: Grammar-aware greybox fuzzing. InProceedings of the 41st International Conference on Software Engineering (ICSE ’19), 2019

2019
[24]

Ohlsson, Bj¨ orn Regnell, and Anders Wessl´ en.Experimentation in Software Engineering

Claes Wohlin, Per Runeson, Martin H¨ ost, Magnus C. Ohlsson, Bj¨ orn Regnell, and Anders Wessl´ en.Experimentation in Software Engineering. Springer Berlin Heidelberg, 2012

2012
[25]

Fuzz4All: Universal fuzzing with large language models.arXiv preprint arXiv:2308.04748,

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Universal fuzzing with large language models.arXiv preprint arXiv:2308.04748,

work page arXiv
[26]

Originally posted 2023; updated 2024

2023
[27]

WhiteFox: White-box compiler fuzzing empowered by large language models

Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. WhiteFox: White-box compiler fuzzing empowered by large language models. InProceedings of the ACM on Programming Languages (OOPSLA ’24), 2024

2024
[28]

KernelGPT: Enhanced kernel fuzzing via large language models

Chenyuan Yang, Zhouruixing Zhao, and Lingming Zhang. KernelGPT: Enhanced kernel fuzzing via large language models. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25), 2025. 40

2025
[29]

Ye et al

J. Ye et al. Mut4All: Fuzzing compilers via LLM-synthesized mutators learned from bug reports.arXiv preprint arXiv:2507.19275, 2025. 41

work page arXiv 2025

[1] [1]

Malf: A multi-agent llm framework for intelligent fuzzing of industrial control protocols,

Anonymous. MALF: A multi-agent LLM framework for intelligent fuzzing of industrial control protocols.arXiv preprint arXiv:2510.02694, 2025

work page arXiv 2025

[2] [2]

Semantic-aware fuzzing: An empirical framework for llm-guided, reasoning-driven input mutation.arXiv preprint arXiv:2509.19533, 2025

Anonymous. Semantic-aware fuzzing: An empirical framework for llm-guided, reasoning-driven input mutation.arXiv preprint arXiv:2509.19533, 2025

work page arXiv 2025

[3] [3]

A practical guide for using statistical tests to assess randomized algorithms in software engineering

Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering (ICSE ’11), pages 1–10, 2011

2011

[4] [4]

NAUTILUS: Fishing for deep bugs with grammars

Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, Ahmad-Reza Sadeghi, and Daniel Teuchert. NAUTILUS: Fishing for deep bugs with grammars. InProceedings of the Network and Distributed System Security Symposium (NDSS ’19), 2019

2019

[5] [5]

REDQUEEN: Fuzzing with input-to-state correspondence

Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, and Thorsten Holz. REDQUEEN: Fuzzing with input-to-state correspondence. InProceedings of the Network and Distributed System Security Symposium (NDSS ’19), 2019

2019

[6] [6]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’23), pages 423–435, 2023

2023

[7] [7]

Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries

Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE ’24), 2024

2024

[8] [8]

AFL++: Combining incremental steps of fuzzing research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++: Combining incremental steps of fuzzing research. USENIX Workshop on Offensive Technologies (WOOT),

[9] [9]

Software v4.21c used in this paper

[10] [10]

ChatFuzz: Augmenting greybox fuzzing with large language models.arXiv preprint arXiv:2308.11525, 2023

Jie Hu, Qian Zhang, and Heng Yin. ChatFuzz: Augmenting greybox fuzzing with large language models.arXiv preprint arXiv:2308.11525, 2023

work page arXiv 2023

[11] [11]

Evaluating fuzz testing

George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. Evaluating fuzz testing. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS ’18), pages 2123–2138, 2018

2018

[12] [12]

Circumventing fuzzing roadblocks with compiler transformations

laf intel. Circumventing fuzzing roadblocks with compiler transformations. https://lafintel. wordpress.com, 2016. Blog post, August 15, 2016. The laf-intel LLVM pass; integrated into AFL++ asAFL LLVM LAF *build flags. 39

2016

[13] [13]

Hybrid fuzzing with LLM-guided input mutation and semantic feedback.arXiv preprint arXiv:2511.03995, 2025

Shiyin Lin. Hybrid fuzzing with LLM-guided input mutation and semantic feedback.arXiv preprint arXiv:2511.03995, 2025

work page arXiv 2025

[14] [14]

OSS-Fuzz-Gen: An open framework for LLM-driven fuzz target generation.arXiv preprint arXiv:2404.14924, 2024

Dongge Liu, Jonathan Metzman, Oliver Chang, and Google OSS-Fuzz team. OSS-Fuzz-Gen: An open framework for LLM-driven fuzz target generation.arXiv preprint arXiv:2404.14924, 2024

work page arXiv 2024

[15] [15]

Liu et al

K. Liu et al. Low-cost and comprehensive non-textual input fuzzing with LLM-synthesized input generators. InProceedings of the 34th USENIX Security Symposium (USENIX Security ’25), 2025. Also arXiv:2501.19282

work page arXiv 2025

[16] [16]

FuzzCoder: Byte-level fuzzing test via large language model.arXiv preprint arXiv:2409.01944, 2024

Liqun Liu et al. FuzzCoder: Byte-level fuzzing test via large language model.arXiv preprint arXiv:2409.01944, 2024

work page arXiv 2024

[17] [17]

The Mutators reloaded: Fuzzing compilers with large language model generated mutators

Conghua Ou et al. The Mutators reloaded: Fuzzing compilers with large language model generated mutators. InProceedings of the 29th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24), 2024

2024

[18] [18]

Santosa, Alexandru R˘ azvan Caciulescu, and Abhik Roychoudhury

Van-Thuan Pham, Marcel B¨ ohme, Andrew E. Santosa, Alexandru R˘ azvan Caciulescu, and Abhik Roychoudhury. Smart greybox fuzzing. InIEEE Transactions on Software Engineering (TSE), 2021. Originally NDSS 2019 (AFLSmart); journal extension 2021

2021

[19] [19]

Schuirmann

Donald J. Schuirmann. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657–680, 1987

1987

[20] [20]

LLM upfront, no steady-state calls

Yuyan Sun et al. Sphinx: A language model guided generator for solver fuzzing. InProceed- ings of the International Conference on Automated Software Engineering (ASE ’24), 2024. Representative “LLM upfront, no steady-state calls” design for SMT-solver fuzzing

2024

[21] [21]

Andr´ as Vargha and Harold D. Delaney. A critique and improvement of the CL common language effect size statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics, 25(2):101–132, 2000

2000

[22] [22]

LLAMAFUZZ: Large language model enhanced greybox fuzzing.arXiv preprint arXiv:2406.07714, 2024

Hongxiang Wang, Xiangwei Xu, Xiaofei Xie, et al. LLAMAFUZZ: Large language model enhanced greybox fuzzing.arXiv preprint arXiv:2406.07714, 2024. Latest revision March 2026

work page arXiv 2024

[23] [23]

Superion: Grammar-aware greybox fuzzing

Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. Superion: Grammar-aware greybox fuzzing. InProceedings of the 41st International Conference on Software Engineering (ICSE ’19), 2019

2019

[24] [24]

Ohlsson, Bj¨ orn Regnell, and Anders Wessl´ en.Experimentation in Software Engineering

Claes Wohlin, Per Runeson, Martin H¨ ost, Magnus C. Ohlsson, Bj¨ orn Regnell, and Anders Wessl´ en.Experimentation in Software Engineering. Springer Berlin Heidelberg, 2012

2012

[25] [25]

Fuzz4All: Universal fuzzing with large language models.arXiv preprint arXiv:2308.04748,

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4All: Universal fuzzing with large language models.arXiv preprint arXiv:2308.04748,

work page arXiv

[26] [26]

Originally posted 2023; updated 2024

2023

[27] [27]

WhiteFox: White-box compiler fuzzing empowered by large language models

Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. WhiteFox: White-box compiler fuzzing empowered by large language models. InProceedings of the ACM on Programming Languages (OOPSLA ’24), 2024

2024

[28] [28]

KernelGPT: Enhanced kernel fuzzing via large language models

Chenyuan Yang, Zhouruixing Zhao, and Lingming Zhang. KernelGPT: Enhanced kernel fuzzing via large language models. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25), 2025. 40

2025

[29] [29]

Ye et al

J. Ye et al. Mut4All: Fuzzing compilers via LLM-synthesized mutators learned from bug reports.arXiv preprint arXiv:2507.19275, 2025. 41

work page arXiv 2025