pith. sign in

arxiv: 2604.17750 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.PL

SDLLMFuzz: Dynamic-static LLM-assisted greybox fuzzing for structured input programs

Pith reviewed 2026-05-10 05:18 UTC · model grok-4.3

classification 💻 cs.CR cs.PL
keywords greybox fuzzingLLM-assisted fuzzingstructured input programsvulnerability discoveryfeedback-driven input generationcrash analysisdynamic-static loop
0
0 comments X

The pith

SDLLMFuzz combines LLM-generated inputs with static crash analysis to discover bugs faster in structured-input programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a fuzzing technique that uses large language models to create test inputs that respect the strict format requirements of certain programs. It adds a feedback mechanism where information from crashes is analyzed statically to inform the next set of LLM-generated inputs. This creates a cycle that helps the fuzzer reach deeper and more relevant parts of the code than mutation-only methods allow. Evaluation on the Magma benchmark with programs such as libxml2 shows improved rates of bug finding and shorter times to the first bug compared to prior approaches.

Core claim

The central claim is that a dynamic-static feedback loop, where LLMs produce syntactically valid and semantically diverse seeds while static analysis of core dumps and execution traces supplies semantic guidance, enables more efficient exploration of complex program behaviors in structured-input programs.

What carries the argument

The dynamic-static feedback loop that refines LLM inputs based on semantic information extracted from crash artifacts.

If this is right

  • Greater success in finding vulnerabilities within programs that process structured data like XML, PNG, and audio files.
  • Shorter intervals between starting the fuzzer and detecting the first bug.
  • Improved ability to generate inputs that satisfy syntactic constraints without relying solely on manual grammars or mutations.
  • More effective use of runtime crash information beyond simple coverage metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such hybrid systems may reduce the need for program-specific customizations in fuzzing tools.
  • Advances in LLM capabilities could further amplify the effectiveness of this feedback approach.
  • Similar techniques might apply to testing other constrained systems, such as network protocols or configuration parsers.

Load-bearing premise

Large language models can reliably produce syntactically valid and semantically diverse inputs, and static analysis of crash artifacts provides rich semantic information that guides effective subsequent input generation.

What would settle it

Repeating the Magma benchmark experiments and observing no significant gains in the number of bugs discovered or the time to first bug over traditional greybox fuzzers and other LLM baselines.

Figures

Figures reproduced from arXiv: 2604.17750 by Futai Zou, Tianming Zheng, Yihao Zou, Yue Wu.

Figure 1
Figure 1. Figure 1: Overview of the SDLLMFuzz framework. 3.5 Crash Feedback Encoding To enable the LLM to utilize crash information, we transform low-level debugging outputs into structured semantic descriptions. This encoding bridges the gap between execution data and LLM reasoning. 3.6 Dynamic–Static Feedback Loop The key component of SDLLMFuzz is the integration of dynamic execution and static analysis into a unified feedb… view at source ↗
Figure 2
Figure 2. Figure 2: Bug discovery results within 24 hours [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bug discovery results within 48 hours. 5.2 Time-to-Bug Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Time-to-Bug heatmap 6 Discussion 6.1 Why SDLLMFuzz Works The effectiveness of SDLLMFuzz stems from the combination of semantic-aware input gener￾ation and feedback-driven refinement. Traditional fuzzing relies on random or heuristic mu￾tations, which are often ineffective for structured-input programs due to strict syntax and semantic constraints. By leveraging LLMs, SDLLMFuzz generates structure-aware inp… view at source ↗
Figure 5
Figure 5. Figure 5: Time-to-Bug curves [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bug discovery in ablation study. testing environments. 6.3 Limitations Despite its effectiveness, SDLLMFuzz has several limitations. First, the performance of the framework depends on the quality of LLM-generated inputs. Poor prompt design or insufficient context may lead to suboptimal seed generation, affecting overall performance. Second, incorporating LLMs and static analysis introduces additional compu… view at source ↗
Figure 7
Figure 7. Figure 7: Time-to-Bug in ablation study. may still remain. External Validity. The evaluation is conducted on the Magma benchmark, which, al￾though widely used, may not fully represent all real-world software systems. The generalizability of the results to other domains requires further validation. Construct Validity. We primarily use bug coverage and time-to-bug as evaluation met￾rics. While these are standard in fu… view at source ↗
read the original abstract

Fuzzing has become a widely adopted technique for vulnerability discovery, yet it remains ineffective for structured-input programs due to strict syntactic constraints and limited semantic awareness. Traditional greybox fuzzers rely on mutation-based strategies and coarse-grained coverage feedback, which often fail to generate valid inputs and explore deep execution paths. Recent advances in large language models (LLMs) have shown promise in improving input generation, but existing approaches primarily focus on seed generation and largely overlook the effective use of runtime feedback. In this paper, we propose SDLLMFuzz, a dynamic-static LLM-assisted greybox fuzzing framework for structured-input programs. Our approach integrates LLM-based structure-aware seed generation with static crash analysis, forming a unified feedback loop that iteratively refines test inputs. Specifically, we leverage LLMs to generate syntactically valid and semantically diverse inputs, while extracting rich semantic information from crash artifacts (e.g., core dumps and execution traces) to guide subsequent input generation. This dynamic-static feedback mechanism enables more efficient exploration of complex program behaviors. We evaluate SDLLMFuzz on the Magma benchmark across multiple structured-input programs, including libxml2, libpng, and libsndfile. Experimental results show that SDLLMFuzz significantly outperforms traditional greybox fuzzers and LLM-assisted baselines in terms of bug discovery and time-to-bug. These results demonstrate that combining semantic input generation with feedback-driven refinement is an effective direction for improving fuzzing performance on structured-input programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes SDLLMFuzz, a greybox fuzzing framework for structured-input programs that combines LLM-based generation of syntactically valid and semantically diverse seeds with static analysis of crash artifacts (core dumps and execution traces) to create a dynamic-static feedback loop. The approach is evaluated on the Magma benchmark for programs including libxml2, libpng, and libsndfile, with the abstract claiming significant outperformance over traditional greybox fuzzers and LLM-assisted baselines in bug discovery and time-to-bug.

Significance. If the empirical claims hold with proper quantification, the work could advance fuzzing for complex input formats by showing how LLM generation plus static crash-derived semantics can improve upon coverage-only feedback. The dynamic-static loop idea addresses a known limitation in greybox fuzzing, and the Magma evaluation target is appropriate for structured programs.

major comments (3)
  1. [Abstract] Abstract: The headline claim that 'Experimental results show that SDLLMFuzz significantly outperforms traditional greybox fuzzers and LLM-assisted baselines in terms of bug discovery and time-to-bug' is unsupported by any quantitative metrics, statistical details, error bars, baseline configurations, run counts, or methodology specifics anywhere in the manuscript.
  2. [Approach] Approach description: No validity-rate statistics, syntactic correctness measurements, or diversity metrics are reported for the LLM-generated inputs on formats such as XML, PNG, or sound files, leaving the core assumption that LLMs reliably produce usable seeds unverified and load-bearing for the claimed gains.
  3. [Evaluation] Evaluation: The manuscript contains no ablation isolating the static crash-analysis component from standard dynamic coverage feedback, nor any comparison of time-to-bug or unique bugs found with and without the static signals; without this, it is impossible to attribute outperformance to the proposed dynamic-static loop rather than engineering or run-time differences.
minor comments (1)
  1. [Abstract] The abstract and approach sections use terms such as 'rich semantic information' and 'unified feedback loop' without defining how the extracted crash data is encoded or fed back to the LLM.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to strengthen the presentation of our results. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that 'Experimental results show that SDLLMFuzz significantly outperforms traditional greybox fuzzers and LLM-assisted baselines in terms of bug discovery and time-to-bug' is unsupported by any quantitative metrics, statistical details, error bars, baseline configurations, run counts, or methodology specifics anywhere in the manuscript.

    Authors: We agree that the abstract would be stronger with explicit quantitative support. In the revised version we will update the abstract to report concrete metrics from our Magma experiments, including the number of unique bugs found per target, mean time-to-bug with standard deviation, the number of independent runs (10 per fuzzer), and the statistical test used for significance. Baseline configurations and run-time settings will also be summarized briefly so the claim is self-contained. revision: yes

  2. Referee: [Approach] Approach description: No validity-rate statistics, syntactic correctness measurements, or diversity metrics are reported for the LLM-generated inputs on formats such as XML, PNG, or sound files, leaving the core assumption that LLMs reliably produce usable seeds unverified and load-bearing for the claimed gains.

    Authors: We acknowledge the omission. We will add a new table and accompanying text in the evaluation section that reports, for each target format, the syntactic validity rate of LLM-generated seeds (percentage that parse without error), the number of unique structural variants produced, and a simple diversity measure such as the count of distinct semantic categories observed across 1,000 samples. These measurements will be obtained from the same generation pipeline used in the main experiments. revision: yes

  3. Referee: [Evaluation] Evaluation: The manuscript contains no ablation isolating the static crash-analysis component from standard dynamic coverage feedback, nor any comparison of time-to-bug or unique bugs found with and without the static signals; without this, it is impossible to attribute outperformance to the proposed dynamic-static loop rather than engineering or run-time differences.

    Authors: We agree that an explicit ablation is necessary to attribute gains to the static component. We will add an ablation study that disables the static crash-analysis feedback while retaining all other components (LLM generation and dynamic coverage) and reports the resulting change in unique bugs discovered and time-to-bug on the same Magma targets and run configuration. The new results will appear in a dedicated subsection of the evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivation chain

full rationale

The paper proposes an LLM-assisted fuzzing framework and supports its claims solely via experimental results on the external Magma benchmark (libxml2, libpng, libsndfile). No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text. The central performance claims are falsifiable against independent baselines and do not reduce to quantities defined by the paper's own choices.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about LLM capabilities and the utility of crash analysis; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Large language models can be used to generate syntactically valid and semantically diverse inputs for structured data formats
    This underpins the seed generation component of the framework.
  • domain assumption Static analysis of crash artifacts such as core dumps and execution traces provides rich semantic information that can guide effective subsequent input generation
    This is required for the dynamic-static feedback loop to function as described.

pith-pipeline@v0.9.0 · 5567 in / 1490 out tokens · 71700 ms · 2026-05-10T05:18:24.189736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Nautilus: Fishing for deep bugs with grammars

    Cornelius Aschermann, Tommaso Frassetto, and Thorsten Holz. Nautilus: Fishing for deep bugs with grammars. In NDSS, 2019

  2. [2]

    Fuzzing: Challenges and reflec- tions

    Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. Fuzzing: Challenges and reflec- tions. IEEE Software , 38(3):79–86, 2020

  3. [3]

    Directed greybox fuzzing

    Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, et al. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2329–2344, 2017

  4. [4]

    Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, volume 8, pages 209–224, 2008

  5. [5]

    Hawkeye: Towards a desired directed grey-box fuzzer

    Hongxu Chen, Yinxing Xue, Yang Li, et al. Hawkeye: Towards a desired directed grey-box fuzzer. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Commu- nications Security, pages 2095–2108, 2018

  6. [6]

    Compiler fuzzing through deep learning

    Chris Cummins, Pavlos Petoumenos, Alastair Murray, et al. Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis , pages 95–105, 2018

  7. [7]

    Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

    Yanjun Deng, Chunqiu Steven Xia, Hao Peng, et al. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis , pages 423–435, 2023

  8. [8]

    Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt

    Yanjun Deng, Chunqiu Steven Xia, Cheng Yang, et al. Large language models are edge- case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 , 2023

  9. [9]

    Machine learning for black-box fuzzing of network protocols

    Rui Fan and Yu Chang. Machine learning for black-box fuzzing of network protocols. In Information and Communications Security: 19th International Conference, ICICS 2017 , pages 621–632. Springer International Publishing, 2018

  10. [10]

    Patrice Godefroid, Adam Kiezun, and Michael Y. Levin. Grammar-based whitebox fuzzing. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation , pages 206–215, 2008. 14

  11. [11]

    Levin, and David Molnar

    Patrice Godefroid, Michael Y. Levin, and David Molnar. Automated whitebox fuzz testing. In NDSS, volume 8, pages 151–166, 2008

  12. [12]

    Learn&fuzz: Machine learning for input fuzzing

    Patrice Godefroid, Hila Peleg, and Rishabh Singh. Learn&fuzz: Machine learning for input fuzzing. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 50–59. IEEE, 2017

  13. [13]

    Ganfuzz: A gan-based industrial network protocol fuzzing framework

    Zhifeng Hu, Jing Shi, Yu-Heng Huang, et al. Ganfuzz: A gan-based industrial network protocol fuzzing framework. In Proceedings of the 15th ACM International Conference on Computing Frontiers, pages 138–145, 2018

  14. [14]

    Fuzzing: A survey

    Jian Li, Bo Zhao, and Chao Zhang. Fuzzing: A survey. Cybersecurity, 1(1):1–13, 2018

  15. [15]

    Deepfuzz: Automatic generation of syn- tax valid c programs for fuzz testing

    Xuefeng Liu, Xiaoting Li, Rohan Prajapati, et al. Deepfuzz: Automatic generation of syn- tax valid c programs for fuzz testing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1044–1051, 2019

  16. [16]

    Miller, Lars Fredriksen, and Bryan So

    Barton P. Miller, Lars Fredriksen, and Bryan So. An empirical study of the reliability of unix utilities. Communications of the ACM , 33(12):32–44, 1990

  17. [17]

    Jensen, and Chris W

    Matthias Sablotny, Bjørn S. Jensen, and Chris W. Johnson. Recurrent neural networks for fuzz testing web browsers. In Information Security and Cryptology–ICISC 2018 , pages 354–370. Springer International Publishing, 2019

  18. [18]

    Fuzzing: Brute Force Vulnerability Discovery

    Michael Sutton, Adam Greene, and Pedram Amini. Fuzzing: Brute Force Vulnerability Discovery. Pearson Education, 2007

  19. [19]

    Superion: Grammar-aware greybox fuzzing

    Junjie Wang, Bihuan Chen, Lei Wei, et al. Superion: Grammar-aware greybox fuzzing. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , pages 724–735. IEEE, 2019

  20. [20]

    Universal fuzzing via large language models

    Chunqiu Steven Xia, Michele Paltenghi, Jie Tian, et al. Fuzz4all: Universal fuzzing with large language models. arXiv preprint arXiv:2308.04748 , 2024

  21. [21]

    Format-aware learn&fuzz: Deep test data generation for efficient fuzzing

    Mohammad Zakeri Nasrabadi, Saeed Parsa, and Alireza Kalaee. Format-aware learn&fuzz: Deep test data generation for efficient fuzzing. Neural Computing and Applications , 33:1497–1513, 2021

  22. [22]

    LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing

    Hongyu Zhang, Yicheng Rong, Yuxuan He, et al. Llamafuzz: Large language model en- hanced greybox fuzzing. arXiv preprint arXiv:2406.07714 , 2024

  23. [23]

    Seqfuzzer: An industrial protocol fuzzing framework from a deep learning perspective

    Hong Zhao, Zhen Li, Hao Wei, et al. Seqfuzzer: An industrial protocol fuzzing framework from a deep learning perspective. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST) , pages 59–67. IEEE, 2019

  24. [24]

    Fuzzing: A survey for roadmap

    Xiaogang Zhu, Sheng Wen, Seyit Camtepe, et al. Fuzzing: A survey for roadmap. ACM Computing Surveys , 54(11s):1–36, 2022. 15