arxiv: 2605.02789 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.CL

Recognition: 2 theorem links

· Lean Theorem

FunFuzz: An LLM-Powered Evolutionary Fuzzing Framework

Mario Rodr\'iguez B\'ejar , B. Romera-Paredes , Jose L. Hern\'andez-Ramos

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:37 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords fuzzingLLMevolutionary fuzzingcompiler testingcoverage guidedprompt adaptationbug discovery

0 comments

The pith

FunFuzz uses parallel isolated searches with candidate migration and prompt adaptation to raise compiler coverage and crash detection over prior LLM methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FunFuzz as a response to variance and redundancy in LLM-generated fuzzing inputs. It structures the process as multiple independent evolutionary searches that periodically exchange high-value program candidates and adjust their generation prompts according to observed compiler coverage gains. Evaluation on GCC and Clang over repeated 24-hour runs shows measurable increases in both total coverage and the number of distinct internal failure signals. A sympathetic reader would care because these gains could translate into faster discovery of defects in widely used compilers and similar structured-input targets.

Core claim

FunFuzz is a multi-island evolutionary fuzzing framework that initializes separate parallel searches with documentation-derived, topic-specific prompts, migrates promising inputs between islands at regular intervals to preserve diversity, and continuously refines prompts through feedback on incremental coverage while using compiler-internal signals to flag crash-inducing cases.

What carries the argument

Multi-island evolutionary mechanism that maintains search diversity through periodic migration of high-value candidates and feedback-guided selection for prompt adaptation.

If this is right

Higher compiler coverage than previous LLM-driven baselines in 24-hour campaigns.
Larger set of unique failure-triggering inputs discovered on both GCC and Clang.
Candidate prioritization by incremental coverage directs effort toward unexplored program regions.
Compiler-internal failure signals reliably surface crash-inducing inputs without external oracles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The island structure could be distributed across machines to increase the scale of parallel exploration.
Coverage feedback for prompt adaptation might transfer to fuzzing other structured targets such as parsers or network protocols.
Periodic migration may reduce the risk of individual searches converging on the same narrow set of inputs.

Load-bearing premise

That the added steps of migrating candidates between islands and adapting prompts on coverage feedback will produce net gains in diversity and coverage without introducing new biases or efficiency losses.

What would settle it

A set of repeated 24-hour runs on GCC and Clang in which FunFuzz produces no higher compiler coverage and no greater number of unique failure-triggering inputs than the earlier LLM baselines.

Figures

Figures reproduced from arXiv: 2605.02789 by B. Romera-Paredes, Jose L. Hern\'andez-Ramos, Mario Rodr\'iguez B\'ejar.

**Figure 1.** Figure 1: FunFuzz strategy: High-level view of the evolutionary loop driven by LLM-generated prompts and feedback-guided scoring. The diagram abstracts away the presence of multiple islands to focus on the core dynamics of a single evolutionary cycle. Red labels denote components introduced in FunFuzz beyond Fuzz4All. component, we divide it into Island Loop, Cross-Island Migration Policy and Fitness Computation and… view at source ↗

**Figure 2.** Figure 2: Evolutionary Multi-Island Fuzzing Loop mechanism across islands by extending the design of FunSearch. As depicted in view at source ↗

**Figure 3.** Figure 3: Coverage evolution over 24 hours in C. FunFuzz significantly outperforms Fuzz4All and keeps improving beyond Kitten in both environments. and 632,670 covered lines for GCC and Clang, versus 292,872 and 597,822 for Fuzz4All (+12.1% and +5.8% improvements, respectively). The warm-start variant further elevates these budget-aligned maximums to 350,519 and 641,707 lines (+19.7% and +7.3%). This demonstrates… view at source ↗

**Figure 4.** Figure 4: Coverage evolution over 24 hours in C++. view at source ↗

**Figure 6.** Figure 6: Cumulative unique GCC bugs discovered over time. view at source ↗

**Figure 5.** Figure 5: Overlap of unique GCC bugs discovered by different view at source ↗

**Figure 7.** Figure 7: Coverage evolution over 24 hours for different is view at source ↗

**Figure 8.** Figure 8: Example of external documentation injected into view at source ↗

**Figure 10.** Figure 10: Prompt used to generate candidate seed instruc view at source ↗

**Figure 11.** Figure 11: Example batch of island seed instructions gener view at source ↗

**Figure 9.** Figure 9: Complete autoprompting variants provided to the view at source ↗

**Figure 12.** Figure 12: Prompt synthesized during the autoprompting view at source ↗

read the original abstract

Modern fuzzers increasingly use Large Language Models (LLMs) to generate structured inputs, but LLM-driven fuzzing is sensitive to prompt initialization and sampling variance, which can reduce exploration efficiency and lead to redundant inputs. We present FunFuzz, a multi-island evolutionary fuzzing framework that runs several isolated searches in parallel and periodically migrates high-value candidates to maintain diversity. FunFuzz derives initial generation prompts from documentation and initializes islands with topic-specific instructions, then continuously adapts prompts using feedback-guided selection. During fuzzing, candidates are prioritized by incremental compiler coverage, while compiler-internal failure signals are used to identify crash-inducing inputs. We evaluate FunFuzz on compiler fuzzing, where inputs are source programs and success is measured by compiler coverage and unique compiler-internal failures. Across repeated 24-hour campaigns on GCC and Clang, FunFuzz achieves higher compiler coverage than previous LLM-driven baselines and discovers more unique failure-triggering inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FunFuzz layers multi-island migration and feedback-driven prompt adaptation onto LLM compiler fuzzing and reports coverage gains, but the gains are not clearly pinned to those additions.

read the letter

FunFuzz runs several parallel LLM searches for compiler inputs, migrates strong candidates between them, and updates prompts on the fly from coverage feedback. The abstract says this beats prior LLM baselines on GCC and Clang after 24-hour runs, both in coverage and in unique crash-inducing programs. That combination of parallelism plus adaptation is the concrete new piece; earlier LLM fuzzers mostly used single prompts or simpler sampling without the island structure or migration step. The evaluation setup itself is straightforward and uses real compiler metrics, which is useful for the subfield. The authors also start prompts from documentation and prioritize by incremental coverage, which keeps the system grounded in the target rather than pure random generation. Those choices make the framework feel like a practical engineering step rather than a pure research prototype. The soft spot is the missing controls. The central claim attributes the extra coverage to the migration and the adaptive prompts, yet nothing in the description shows an ablation that turns those features off and measures the drop. Without that, or at least matched query budgets and variance numbers across runs, the improvement could come from more total LLM calls, better initial prompts, or implementation details unrelated to the evolutionary design. The stress-test note flags exactly this gap, and it still applies. The paper is aimed at people who already work on LLM-assisted fuzzing or compiler testing. A reader in that niche can pull the architecture and try the migration idea themselves, but anyone outside the area will find the results hard to judge without the full methods and artifacts. I would send it for peer review because the system is described clearly enough to be implemented and the empirical claims are testable, even if the current evidence needs tightening on the causal side. Expect referees to ask for the ablations and the raw numbers.

Referee Report

2 major / 2 minor

Summary. The paper introduces FunFuzz, a multi-island evolutionary fuzzing framework powered by LLMs for generating compiler inputs. It runs several isolated parallel searches, periodically migrates high-value candidates to maintain diversity, derives initial prompts from documentation with topic-specific instructions, and adapts prompts via feedback-guided selection. Candidates are prioritized by incremental compiler coverage, and compiler-internal signals identify crashes. On GCC and Clang, repeated 24-hour campaigns reportedly yield higher coverage and more unique failure-triggering inputs than prior LLM-driven baselines.

Significance. If the empirical gains are shown to be robustly attributable to the evolutionary mechanisms rather than confounds, the work would offer a practical advance in LLM-based fuzzing by addressing prompt sensitivity and exploration redundancy. The combination of documentation-derived initialization, coverage-driven prioritization, and failure-signal detection is a sensible engineering contribution with potential applicability beyond compilers. However, the current evidence base limits the strength of this assessment.

major comments (2)

Evaluation section (and abstract): The central claim attributes higher coverage and more unique failures to the multi-island design with periodic migration and feedback-guided prompt adaptation. No ablations are described that isolate these components (e.g., single-island vs. multi-island, static vs. adaptive prompts, or migration disabled). Without such controls or total LLM-query budgets matched to baselines, the observed improvements could arise from implementation details unrelated to the evolutionary framework, leaving the causal attribution insecure.
Evaluation section: The abstract and summary state superior coverage and failure discovery across repeated 24-hour campaigns but provide no information on the number of independent runs, statistical significance testing, variance across runs, or exact baseline implementations (including prompt templates and compute budgets). These omissions make it impossible to assess whether the reported gains are reliable or reproducible.

minor comments (2)

Abstract: The phrase 'higher compiler coverage than previous LLM-driven baselines' is vague; specify the exact baselines and the magnitude of the improvement (e.g., percentage points or absolute coverage numbers) for clarity.
The paper would benefit from a brief discussion of potential biases introduced by migration overhead or prompt adaptation, even if negative results are reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: Evaluation section (and abstract): The central claim attributes higher coverage and more unique failures to the multi-island design with periodic migration and feedback-guided prompt adaptation. No ablations are described that isolate these components (e.g., single-island vs. multi-island, static vs. adaptive prompts, or migration disabled). Without such controls or total LLM-query budgets matched to baselines, the observed improvements could arise from implementation details unrelated to the evolutionary framework, leaving the causal attribution insecure.

Authors: We agree that the absence of component ablations weakens causal attribution to the multi-island and adaptive mechanisms. The current manuscript compares FunFuzz only against prior LLM-driven baselines under matched 24-hour wall-clock time; it does not include controlled variants that disable migration or adaptive prompting. In the revision we will add ablation experiments (single-island, static-prompt, and no-migration configurations) while enforcing identical total LLM query budgets across all settings. These new results will be reported with the same coverage and failure metrics to isolate the contribution of the evolutionary components. revision: yes
Referee: Evaluation section: The abstract and summary state superior coverage and failure discovery across repeated 24-hour campaigns but provide no information on the number of independent runs, statistical significance testing, variance across runs, or exact baseline implementations (including prompt templates and compute budgets). These omissions make it impossible to assess whether the reported gains are reliable or reproducible.

Authors: We acknowledge that the evaluation reporting is incomplete on these points. Although the manuscript refers to repeated 24-hour campaigns, it does not state the number of runs, report variance, perform significance tests, or fully document baseline re-implementations. We will revise the evaluation section to specify the exact number of independent runs (five per configuration), present mean and standard-deviation values for coverage and unique failures, include statistical significance tests (Mann-Whitney U with Bonferroni correction), and supply the precise prompt templates and LLM query budgets used for each baseline. These additions will directly address reproducibility concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivation chain

full rationale

The paper presents an LLM-powered evolutionary fuzzing framework evaluated through repeated 24-hour experimental campaigns measuring compiler coverage and unique failure-triggering inputs. No mathematical derivations, equations, fitted parameters, or self-referential predictions exist in the described approach or results. Claims rest on direct empirical comparisons to baselines rather than any internal logic that reduces by construction to the framework's own inputs or assumptions. The multi-island design, prompt adaptation, and prioritization are implementation choices whose effectiveness is tested externally via experiments, not derived tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an applied engineering framework with no mathematical free parameters, axioms, or postulated entities; it recombines existing LLM prompting, evolutionary search, and coverage-guided fuzzing techniques.

pith-pipeline@v0.9.0 · 5472 in / 1221 out tokens · 45543 ms · 2026-05-08T18:37:45.643367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435

2023
[2]

Martin Eberlein, Yannic Noller, Thomas Vogel, and Lars Grunske. 2020. Evo- lutionary grammar-based fuzzing. InInternational Symposium on Search Based Software Engineering. Springer, 105–120

2020
[3]

Karine Even-Mendoza, Cristian Cadar, and Alastair F Donaldson. 2022. CsmithEdge: more effective compiler testing by handling undefined behaviour less conservatively.Empirical Software Engineering27, 6 (2022), 129

2022
[4]

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. {AFL++}: Combining incremental steps of fuzzing research. In14th USENIX workshop on offensive technologies (WOOT 20)

2020
[5]

Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L Hosking. 2021. Seed selection for successful fuzzing. InProceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis. FunFuzz: An LLM-Powered Evolutionary Fuzzing Framework Preprint, 2026, 230–243

2021
[6]

Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with code fragments. In21st USENIX Security Symposium (USENIX Security 12). 445–458

2012
[7]

Linghan Huang, Peizhou Zhao, Huaming Chen, and Lei Ma. 2024. Large language models based fuzzing techniques: A survey.arXiv e-prints(2024), arXiv–2402

2024
[8]

Linghan Huang, Peizhou Zhao, Lei Ma, and Huaming Chen. 2025. On the chal- lenges of fuzzing techniques via large language models. In2025 IEEE International Conference on Software Services Engineering (SSE). IEEE, 162–171

2025
[9]

Jaeseong Kwon, Bongjun Jang, Juneyoung Lee, and Kihong Heo. 2025. Optimization-Directed Compiler Fuzzing for Continuous Translation Validation. Proceedings of the ACM on Programming Languages9, PLDI (2025), 627–650

2025
[10]

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. 2025. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349(2025)

work page internal anchor Pith review arXiv 2025
[11]

Yuwei Li, Shouling Ji, Chenyang Lv, Yuan Chen, Jianhai Chen, Qinchen Gu, and Chunming Wu. 2019. V-fuzz: Vulnerability-oriented evolutionary fuzzing.arXiv preprint arXiv:1901.01142(2019)

work page arXiv 2019
[12]

LLVM Project. [n. d.]. libFuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html. Accessed: 2026-02-03

2026
[13]

Haoyang Ma. 2023. A survey of modern compiler fuzzing.arXiv preprint arXiv:2306.06884(2023)

work page arXiv 2023
[14]

Sanoop Mallissery and Yu-Sung Wu. 2023. Demystify the fuzzing methods: A comprehensive survey.Comput. Surveys56, 3 (2023), 1–38

2023
[15]

Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. 2025. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131(2025)

work page internal anchor Pith review arXiv 2025
[16]

Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The mutators reloaded: Fuzzing compilers with large language model generated mutation operators. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4. 298–312

2024
[17]

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. 2024. Mathematical discoveries from program search with large language models.Nature625, 7995 (2024), 468–475

2024
[18]

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. InPro- ceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[19]

Yuanmin Xie, Zhenyang Xu, Yongqiang Tian, Min Zhou, Xintong Zhou, and Chengnian Sun. 2025. Kitten: A Simple Yet Effective Baseline for Evaluating LLM- Based Compiler Testing Techniques. InProceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis. 21–25

2025
[20]

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and under- standing bugs in C compilers. InProceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294

2011
[21]

Michał Zalewski. 2014. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/ afl/. Accessed: 2026-02-03

2014
[22]

Hongxiang Zhang, Yuyang Rong, Yifeng He, and Hao Chen. 2024. Llamafuzz: Large language model enhanced greybox fuzzing.arXiv preprint arXiv:2406.07714 (2024)

work page arXiv 2024
[23]

from-scratch

Xiaogang Zhu, Wei Zhou, Qing-Long Han, Wanlun Ma, Sheng Wen, and Yang Xiang. 2025. When software security meets large language models: A survey. IEEE/CAA Journal of Automatica Sinica12, 2 (2025), 317–334. A Open Science We provide an anonymized artifact repository containing theFun- Fuzzimplementation, configuration files, and scripts used to run the expe...

2025