Large Language Model assisted Hybrid Fuzzing
Pith reviewed 2026-05-23 07:08 UTC · model grok-4.3
The pith
An LLM can replace traditional concolic execution in hybrid fuzzing by generating inputs from sliced traces to reach more branches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LLM can serve as the solver inside hybrid fuzzing: after greybox fuzzing reaches a roadblock, a slice of the execution trace is fed to the model, which then produces a modified input that steers execution toward the desired branch. This LLM-based concolic step covers 31.43 percent more branches than CoFuzz, 44.56 percent more than Intriguer, and 59.48 percent more than QSYM. The same step finishes 3 to 19 times faster than the concolic engines in those tools, and the overall fuzzer found seven previously unknown bugs in the tested subjects.
What carries the argument
LLM acting as solver on sliced execution traces to produce input modifications that reach target branches
If this is right
- HyllFuzz covers 31.43 percent more branches than CoFuzz, 44.56 percent more than Intriguer, and 59.48 percent more than QSYM.
- The LLM-based concolic execution finishes 3 to 19 times faster than the concolic engines in the compared tools.
- The approach exposed seven previously unknown bugs in extensively tested real-world programs.
- LLM assistance can be inserted directly into the iterative greybox-plus-concolic loop without requiring full symbolic execution engines.
Where Pith is reading between the lines
- The same trace-slicing plus LLM prompting pattern could be tried in other dynamic analysis settings that currently rely on symbolic or concolic solvers.
- Reducing dependence on environment models might let hybrid fuzzers target a wider set of libraries and operating-system interfaces with less manual setup.
- Whether the coverage gains persist when the underlying LLM changes or when the programs are substantially larger than those tested remains open.
Load-bearing premise
An LLM can reliably produce input modifications that correctly reach the intended branches when given only a sliced execution trace.
What would settle it
A head-to-head run on the same subjects in which HyllFuzz covers no more branches and runs no faster than CoFuzz, Intriguer, or QSYM would show the claimed gains do not hold.
Figures
read the original abstract
Greybox fuzzing is one of the most popular methods for detecting software vulnerabilities, which conducts a biased random search within the program input space. To enhance its effectiveness in achieving deep coverage of program behaviors, greybox fuzzing is often combined with concolic execution, which performs a path-sensitive search over the domain of program inputs. In hybrid fuzzing, conventional greybox fuzzing is followed by concolic execution in an iterative loop, where reachability roadblocks encountered by greybox fuzzing are tackled by concolic execution. However, such hybrid fuzzing still suffers from difficulties conventionally faced by concolic execution, such as the need for environment modeling and system call support. In this work, we explore the potential of developing "smart" concolic execution empowered by Large Language Models (LLMs), leveraging their knowledge of code semantics during constraint computing and solving. When coverage-based greybox fuzzing reaches a roadblock in terms of reaching certain branches, we conduct a slicing on the execution trace and suggest modifications of the input to reach the relevant branches. The LLM is used as a solver to generate the modified input to reach the desired branches. Compared with state-of-the-art hybrid fuzzers CoFuzz, Intriguer, and QSYM, our LLM-based hybrid fuzzer HyllFuzz(pronounced "hill fuzz") covers 31.43%, 44.56%, and 59.48% more code branches, respectively. Furthermore, the LLM-based concolic execution in HyllFuzz takes a time that is 3--19 times faster than the concolic execution running in existing hybrid fuzzing tools. In extensively tested real-world subjects, HyllFuzz exposed seven previously unknown bugs. This experience shows that LLMs can be effectively inserted into the iterative loop of hybrid fuzzers to efficiently expose more program behaviors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HyllFuzz, a hybrid fuzzer that augments greybox fuzzing with an LLM-based concolic execution step: when greybox fuzzing hits a roadblock, execution traces are sliced and the LLM is invoked to generate modified inputs that reach target branches. It reports that HyllFuzz achieves 31.43%, 44.56%, and 59.48% more branch coverage than CoFuzz, Intriguer, and QSYM respectively, that its LLM-based concolic execution is 3–19× faster than conventional concolic execution, and that it discovered seven previously unknown bugs in real-world subjects.
Significance. If the central claim that the LLM reliably solves path constraints from sliced traces holds, the work would offer a practical route to mitigating long-standing concolic-execution obstacles (environment modeling, system-call support) and could materially improve hybrid-fuzzing effectiveness on complex programs. The explicit multi-tool comparison and the concrete bug findings would constitute useful empirical evidence for the community.
major comments (3)
- [Abstract and §3] Abstract and §3 (Approach): the headline coverage and speedup claims rest on the assertion that the LLM functions as a constraint solver, yet no prompt template, no example of a sliced trace constraint fed to the LLM, and no verification (e.g., via Z3 or concrete execution) that the generated inputs satisfy the original constraints are supplied. Without this evidence the observed gains cannot be confidently attributed to constraint solving rather than ancillary changes in mutation budget or scheduling.
- [§5] §5 (Evaluation): the quantitative results are presented without any description of experimental controls, number of independent runs, statistical significance tests, seed-selection protocol, or precise baseline configurations (e.g., timeout, mutation parameters, or environment settings for CoFuzz/Intriguer/QSYM). This absence directly undermines the soundness of the reported 31–59% coverage deltas.
- [§4–5] §4–5: the claim that LLM-based concolic execution is 3–19× faster is load-bearing for the overall contribution, yet the paper supplies neither the measurement methodology (wall-clock vs. CPU time, inclusion/exclusion of LLM API latency) nor any breakdown showing that the speedup is not simply an artifact of reduced path-exploration depth.
minor comments (2)
- [Abstract and §5] The abstract states that seven previously unknown bugs were found but provides no CVE identifiers, bug types, or reproduction details; a short table or paragraph in §5 would improve traceability.
- [§3] Notation for the slicing procedure and the interface between greybox fuzzer and LLM component is introduced without a diagram or pseudocode; a single figure would clarify the iterative loop.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Approach): the headline coverage and speedup claims rest on the assertion that the LLM functions as a constraint solver, yet no prompt template, no example of a sliced trace constraint fed to the LLM, and no verification (e.g., via Z3 or concrete execution) that the generated inputs satisfy the original constraints are supplied. Without this evidence the observed gains cannot be confidently attributed to constraint solving rather than ancillary changes in mutation budget or scheduling.
Authors: We agree that the manuscript would benefit from greater transparency on the LLM invocation. In the revision we will add the exact prompt template and a worked example of a sliced execution trace together with the LLM-generated input. On verification, the current evaluation relies on end-to-end coverage and bug-finding results rather than per-constraint Z3 checks; we will add a short discussion of this design choice and, where feasible, include concrete-execution validation for a sample of generated inputs. The attribution to constraint solving follows directly from the architecture in which the LLM replaces the conventional solver after trace slicing. revision: yes
-
Referee: [§5] §5 (Evaluation): the quantitative results are presented without any description of experimental controls, number of independent runs, statistical significance tests, seed-selection protocol, or precise baseline configurations (e.g., timeout, mutation parameters, or environment settings for CoFuzz/Intriguer/QSYM). This absence directly undermines the soundness of the reported 31–59% coverage deltas.
Authors: We accept that these methodological details are required for reproducibility. The revised evaluation section will explicitly state the number of independent runs performed, the statistical tests applied, the seed-selection protocol (identical initial seeds across all tools), and the precise versions, timeouts, and mutation parameters used for each baseline. revision: yes
-
Referee: [§4–5] §4–5: the claim that LLM-based concolic execution is 3–19× faster is load-bearing for the overall contribution, yet the paper supplies neither the measurement methodology (wall-clock vs. CPU time, inclusion/exclusion of LLM API latency) nor any breakdown showing that the speedup is not simply an artifact of reduced path-exploration depth.
Authors: We will augment §§4–5 with a clear description of the timing methodology (wall-clock time inclusive of API latency) and will add a per-phase breakdown comparing LLM-based solving against the conventional concolic engines on matched sets of paths. This will demonstrate that the reported speedup is not an artifact of shallower exploration. revision: yes
Circularity Check
No circularity: empirical results rest on external tool comparisons, not self-defined quantities
full rationale
The paper reports experimental coverage gains (31.43–59.48%) and speedups (3–19×) from running HyllFuzz against CoFuzz, Intriguer, and QSYM on real-world subjects, plus seven new bugs. These are direct measurement outcomes from an implemented system, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No mathematical derivation, ansatz, or uniqueness theorem is invoked; the central claim (LLM as constraint solver after trace slicing) is presented as an engineering hypothesis validated by external benchmarks rather than reduced to the authors' prior definitions or inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs possess sufficient knowledge of code semantics to solve constraints for input generation from execution traces
Reference graph
Works this paper leans on
-
[1]
Redqueen: Fuzzing with input-to-state correspondence
Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, and Thorsten Holz. Redqueen: Fuzzing with input-to-state correspondence. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS) , volume 19, pages 1–15, 2019
work page 2019
-
[2]
Fuzzing: Challenges and reflections
Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. Fuzzing: Challenges and reflections. IEEE Software, 38(3):79– 86, 2020
work page 2020
-
[3]
On the reliability of coverage-based fuzzer benchmarking
Marcel Böhme, László Szekeres, and Jonathan Metzman. On the reliability of coverage-based fuzzer benchmarking. In Proceedings of the 44th International Conference on Software Engineering (ICSE) , pages 1621–1633, 2022
work page 2022
- [4]
-
[5]
Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs
Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 209–224, 2008
work page 2008
-
[6]
Exe: Automatically generating inputs of death
Cristian Cadar, Vijay Ganesh, Peter M Pawlowski, David L Dill, and Dawson R Engler. Exe: Automatically generating inputs of death. ACM Transactions on Information and System Security (TISSEC) , 12(2):1–38, 2008
work page 2008
-
[7]
Symbolic execution for software testing: three decades later
Cristian Cadar and Koushik Sen. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82–90, 2013
work page 2013
-
[8]
{MEUZZ}: Smart seed scheduling for hybrid fuzzing
Yaohui Chen, Mansour Ahmadi, Boyu Wang, Long Lu, et al. {MEUZZ}: Smart seed scheduling for hybrid fuzzing. In Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages 77–92, 2020
work page 2020
-
[9]
Chatunitest: A framework for llm-based test generation
Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), pages 572–576, 2024
work page 2024
-
[10]
S2e: A platform for in-vivo multi-path analysis of software systems
Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. S2e: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 265–278, 2011
work page 2011
-
[11]
Intriguer: Field-level constraint solving for hybrid fuzzing
Mingi Cho, Seoyoung Kim, and Taekyoung Kwon. Intriguer: Field-level constraint solving for hybrid fuzzing. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS) , pages 515–530, 2019
work page 2019
-
[12]
Coverage-guided, in-process fuzzing for the jvm
Code-Intelligence. Coverage-guided, in-process fuzzing for the jvm. https://github.com/CodeIntelligenceTesting/jazzer
-
[13]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis (ISSTA) , pages 423–435, 2023
work page 2023
-
[14]
AFL++ : Combining incremental steps of fuzzing research
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++ : Combining incremental steps of fuzzing research. In Proceedings of the 14th USENIX Workshop on Offensive Technologies , 2020
work page 2020
-
[15]
Collafl: Path sensitive fuzzing
Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. Collafl: Path sensitive fuzzing. In Proceedings of the 39th IEEE Symposium on Security and Privacy (IEEE S&P) , pages 679–696, 2018
work page 2018
-
[16]
Dart: Directed automated random testing
Patrice Godefroid, Nils Klarlund, and Koushik Sen. Dart: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI) , pages 213–223, 2005
work page 2005
-
[17]
Large language models for software engineering: A systematic literature review
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 2023
work page 2023
-
[18]
Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction
Heqing Huang, Peisen Yao, Rongxin Wu, Qingkai Shi, and Charles Zhang. Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction. In Proceedings of the 41st IEEE Symposium on Security and Privacy (IEEE S&P) , pages 1613–1627, 2020
work page 2020
-
[19]
Towards understanding the effectiveness of large language models on directed test input generation
Zongze Jiang, Ming Wen, Jialun Cao, Xuanhua Shi, and Hai Jin. Towards understanding the effectiveness of large language models on directed test input generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1408–1420, 2024
work page 2024
-
[20]
A segmented memory model for symbolic execution
Timotej Kapus and Cristian Cadar. A segmented memory model for symbolic execution. InProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 774–784, 2019
work page 2019
-
[21]
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security (CCS) , pages 2123–2138, 2018
work page 2018
-
[22]
{UNIFUZZ }: A holistic and pragmatic {Metrics-Driven} platform for evaluating fuzzers
Yuwei Li, Shouling Ji, Yuan Chen, Sizhuang Liang, Wei-Han Lee, Yueyao Chen, Chenyang Lyu, Chunming Wu, Raheem Beyah, Peng Cheng, et al. {UNIFUZZ }: A holistic and pragmatic {Metrics-Driven} platform for evaluating fuzzers. In Proceedings of the 30th USENIX Security Symposium (USENIX Security) , pages 2777–2794, 2021
work page 2021
-
[23]
Rupak Majumdar and Koushik Sen. Hybrid concolic testing. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 416–426, 2007. , Vol. 1, No. 1, Article . Publication date: December 2024. 20 Ruijie Meng, Gregory J. Duck, and Abhik Roychoudhury
work page 2007
-
[24]
Large language model guided protocol fuzzing
Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS) , 2024
work page 2024
-
[25]
Fuzzbench: an open fuzzer benchmarking platform and service
Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE) , pages 1393–1403, 2021
work page 2021
-
[26]
Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey
Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024
-
[27]
Transformed vargha-delaney effect size
Geoffrey Neumann, Mark Harman, and Simon Poulding. Transformed vargha-delaney effect size. In Proceedings of Search-Based Software Engineering: 7th International Symposium , pages 318–324, 2015
work page 2015
-
[28]
Coverup: Coverage-guided llm-based test generation
Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage-guided llm-based test generation. arXiv preprint arXiv:2403.16218, 2024
-
[29]
Code-aware prompting: A study of coverage-guided test generation in regression setting using llm
Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024
work page 2024
-
[31]
An empirical evaluation of using large language models for automated unit test generation
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering , 2023
work page 2023
-
[32]
Cute: A concolic unit testing engine for c
Koushik Sen, Darko Marinov, and Gul Agha. Cute: A concolic unit testing engine for c. ACM SIGSOFT Software Engineering Notes, 30(5):263–272, 2005
work page 2005
-
[33]
Llm4fuzz: Guided fuzzing of smart contracts with large language models
Chaofan Shou, Jing Liu, Doudou Lu, and Koushik Sen. Llm4fuzz: Guided fuzzing of smart contracts with large language models. arXiv preprint arXiv:2401.11108, 2024
-
[34]
Driller: Augmenting fuzzing through selective symbolic execution
Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In Proceedings of the 2016 Network and Distributed System Security Symposium (NDSS) , volume 16, pages 1–16, 2016
work page 2016
-
[35]
Fuzz4all: Universal fuzzing with large language models
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 1–13, 2024
work page 2024
-
[36]
On the evaluation of large language models in unit test generation
Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 1607–1619, 2024
work page 2024
-
[37]
{QSYM}: A practical concolic execution engine tailored for hybrid fuzzing
Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. {QSYM}: A practical concolic execution engine tailored for hybrid fuzzing. In Proceedings of the 27th USENIX Security Symposium (USENIX Security) , pages 745–761, 2018
work page 2018
-
[38]
Michał Zalewski. AFL. https://lcamtuf.coredump.cx/afl/
-
[39]
Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing
Lei Zhao, Yue Duan, and Jifeng XUAN. Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS) , 2019. , Vol. 1, No. 1, Article . Publication date: December 2024
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.