Large Language Model assisted Hybrid Fuzzing

Abhik Roychoudhury; Gregory J. Duck; Ruijie Meng

arxiv: 2412.15931 · v2 · submitted 2024-12-20 · 💻 cs.SE · cs.CR

Large Language Model assisted Hybrid Fuzzing

Ruijie Meng , Gregory J. Duck , Abhik Roychoudhury This is my paper

Pith reviewed 2026-05-23 07:08 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords hybrid fuzzinglarge language modelsconcolic executiongreybox fuzzingcode coverageinput generationsoftware testingvulnerability detection

0 comments

The pith

An LLM can replace traditional concolic execution in hybrid fuzzing by generating inputs from sliced traces to reach more branches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can handle the concolic execution step inside the standard hybrid fuzzing loop. When greybox fuzzing stalls at a branch, the system slices the execution trace and prompts the LLM to propose input changes that would reach the target. This substitution sidesteps the environment modeling and system-call problems that normally limit concolic execution. Experiments on real-world programs show the resulting tool, HyllFuzz, reaches substantially more branches than three existing hybrid fuzzers while completing the constraint-solving step much faster and exposing new bugs.

Core claim

The central claim is that an LLM can serve as the solver inside hybrid fuzzing: after greybox fuzzing reaches a roadblock, a slice of the execution trace is fed to the model, which then produces a modified input that steers execution toward the desired branch. This LLM-based concolic step covers 31.43 percent more branches than CoFuzz, 44.56 percent more than Intriguer, and 59.48 percent more than QSYM. The same step finishes 3 to 19 times faster than the concolic engines in those tools, and the overall fuzzer found seven previously unknown bugs in the tested subjects.

What carries the argument

LLM acting as solver on sliced execution traces to produce input modifications that reach target branches

If this is right

HyllFuzz covers 31.43 percent more branches than CoFuzz, 44.56 percent more than Intriguer, and 59.48 percent more than QSYM.
The LLM-based concolic execution finishes 3 to 19 times faster than the concolic engines in the compared tools.
The approach exposed seven previously unknown bugs in extensively tested real-world programs.
LLM assistance can be inserted directly into the iterative greybox-plus-concolic loop without requiring full symbolic execution engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace-slicing plus LLM prompting pattern could be tried in other dynamic analysis settings that currently rely on symbolic or concolic solvers.
Reducing dependence on environment models might let hybrid fuzzers target a wider set of libraries and operating-system interfaces with less manual setup.
Whether the coverage gains persist when the underlying LLM changes or when the programs are substantially larger than those tested remains open.

Load-bearing premise

An LLM can reliably produce input modifications that correctly reach the intended branches when given only a sliced execution trace.

What would settle it

A head-to-head run on the same subjects in which HyllFuzz covers no more branches and runs no faster than CoFuzz, Intriguer, or QSYM would show the claimed gains do not hold.

Figures

Figures reproduced from arXiv: 2412.15931 by Abhik Roychoudhury, Gregory J. Duck, Ruijie Meng.

**Figure 1.** Figure 1: Code fragment adapted from the cJSON.c file within the cJSON subject. efficiency. As a result, the performance bottlenecks inherent in concolic execution remain a primary limiting factor for hybrid fuzzing. 2.2 Limitations on Concolic Execution Concolic execution spends considerable time on heavy-weight symbolic emulation and constraint solving. A concolic execution engine relies on constraint solvers as i… view at source ↗

**Figure 2.** Figure 2: Workflow of solving the roadblock at cJSON.c:24 based on a seed input (selected from the seed corpus of the greybox fuzzer) and the constraints related to the roadblock (sliced from source code), and the new input is generated by the LLM. forms the Roadblock item. Next, HyLLfuzz runs the program using the selected input and collects the execution trace. Afterward, HyLLfuzz performs a backward code slicing … view at source ↗

**Figure 3.** Figure 3: Overall workflow of HyLLfuzz including greybox fuzzing and LLM-based concolic execution. the greybox fuzzer struggles to cover new branches within a given time, HyLLfuzz steps in by identifying roadblocks from its maintained coverage report, which include uncovered branches and the inputs that reach but fail to satisfy the path conditions. Afterward, HyLLfuzz simulates traditional concolic execution by exe… view at source ↗

**Figure 4.** Figure 4: Prompt template for generating new input based on the given input and relevant code slice. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Average code coverage over time by AFL, Intriguer, QSYM and HyLLfuzz across 10 runs of 24 hours [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Average percentage of effective inputs generated by [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Time used by Intriguer, QSYM, and HyLLfuzz for each run of concolic execution (seconds). On average, 13.21% of inputs generated by HyLLfuzz effectively explore new code paths, compared to 8.83% for Intriguer and 21.74% for QSYM. 4.4 Time Consumption (RQ.3) In this experiment, we measured the time consumed by each concolic execution run in Intriguer, QSYM, and HyLLfuzz. To do this, we randomly sampled 5000 … view at source ↗

**Figure 8.** Figure 8: Case study the bug from the Jenkins subject released in the 2024 DARPA and ARPA-H’s Artificial [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Greybox fuzzing is one of the most popular methods for detecting software vulnerabilities, which conducts a biased random search within the program input space. To enhance its effectiveness in achieving deep coverage of program behaviors, greybox fuzzing is often combined with concolic execution, which performs a path-sensitive search over the domain of program inputs. In hybrid fuzzing, conventional greybox fuzzing is followed by concolic execution in an iterative loop, where reachability roadblocks encountered by greybox fuzzing are tackled by concolic execution. However, such hybrid fuzzing still suffers from difficulties conventionally faced by concolic execution, such as the need for environment modeling and system call support. In this work, we explore the potential of developing "smart" concolic execution empowered by Large Language Models (LLMs), leveraging their knowledge of code semantics during constraint computing and solving. When coverage-based greybox fuzzing reaches a roadblock in terms of reaching certain branches, we conduct a slicing on the execution trace and suggest modifications of the input to reach the relevant branches. The LLM is used as a solver to generate the modified input to reach the desired branches. Compared with state-of-the-art hybrid fuzzers CoFuzz, Intriguer, and QSYM, our LLM-based hybrid fuzzer HyllFuzz(pronounced "hill fuzz") covers 31.43%, 44.56%, and 59.48% more code branches, respectively. Furthermore, the LLM-based concolic execution in HyllFuzz takes a time that is 3--19 times faster than the concolic execution running in existing hybrid fuzzing tools. In extensively tested real-world subjects, HyllFuzz exposed seven previously unknown bugs. This experience shows that LLMs can be effectively inserted into the iterative loop of hybrid fuzzers to efficiently expose more program behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyllFuzz swaps an LLM in for the concolic solver after trace slicing and reports 30-60% more branches plus 3-19x speedups over three baselines, but the abstract gives no evidence the LLM actually solves constraints.

read the letter

The main thing here is the substitution of an LLM for the concolic engine inside the hybrid loop. After greybox fuzzing hits a roadblock, they slice the trace and feed it to the LLM to produce a modified input that reaches the target branch. That specific replacement is not in the prior tools they cite, so the application is new on its face. The paper also runs the comparison against named systems (CoFuzz, Intriguer, QSYM), gives concrete branch-coverage deltas, notes the time savings on the concolic step, and reports seven new bugs in real subjects. Those elements are useful for anyone tracking hybrid-fuzzer performance numbers. The architecture description is straightforward and the motivation around environment-modeling problems is clear. The soft spot is exactly where the stress-test note flags it. The coverage and speedup claims rest on the LLM functioning as a constraint solver, yet the abstract supplies no prompt template, no example of a sliced constraint versus the input it produced, and no check that the output satisfies the original path condition. Without that, the gains could come from extra mutation budget, different scheduling, or other unmentioned changes rather than genuine solving. The experimental section also omits controls, baseline configuration details, and any statistical checks. This is aimed at the fuzzing and software-security tooling group. Someone already building or evaluating hybrid fuzzers would want to see the full methods to decide whether the LLM step is reproducible. I would send it for peer review. The idea is simple enough and the reported results are specific enough that a referee could usefully check the missing verification steps and experimental controls.

Referee Report

3 major / 2 minor

Summary. The paper proposes HyllFuzz, a hybrid fuzzer that augments greybox fuzzing with an LLM-based concolic execution step: when greybox fuzzing hits a roadblock, execution traces are sliced and the LLM is invoked to generate modified inputs that reach target branches. It reports that HyllFuzz achieves 31.43%, 44.56%, and 59.48% more branch coverage than CoFuzz, Intriguer, and QSYM respectively, that its LLM-based concolic execution is 3–19× faster than conventional concolic execution, and that it discovered seven previously unknown bugs in real-world subjects.

Significance. If the central claim that the LLM reliably solves path constraints from sliced traces holds, the work would offer a practical route to mitigating long-standing concolic-execution obstacles (environment modeling, system-call support) and could materially improve hybrid-fuzzing effectiveness on complex programs. The explicit multi-tool comparison and the concrete bug findings would constitute useful empirical evidence for the community.

major comments (3)

[Abstract and §3] Abstract and §3 (Approach): the headline coverage and speedup claims rest on the assertion that the LLM functions as a constraint solver, yet no prompt template, no example of a sliced trace constraint fed to the LLM, and no verification (e.g., via Z3 or concrete execution) that the generated inputs satisfy the original constraints are supplied. Without this evidence the observed gains cannot be confidently attributed to constraint solving rather than ancillary changes in mutation budget or scheduling.
[§5] §5 (Evaluation): the quantitative results are presented without any description of experimental controls, number of independent runs, statistical significance tests, seed-selection protocol, or precise baseline configurations (e.g., timeout, mutation parameters, or environment settings for CoFuzz/Intriguer/QSYM). This absence directly undermines the soundness of the reported 31–59% coverage deltas.
[§4–5] §4–5: the claim that LLM-based concolic execution is 3–19× faster is load-bearing for the overall contribution, yet the paper supplies neither the measurement methodology (wall-clock vs. CPU time, inclusion/exclusion of LLM API latency) nor any breakdown showing that the speedup is not simply an artifact of reduced path-exploration depth.

minor comments (2)

[Abstract and §5] The abstract states that seven previously unknown bugs were found but provides no CVE identifiers, bug types, or reproduction details; a short table or paragraph in §5 would improve traceability.
[§3] Notation for the slicing procedure and the interface between greybox fuzzer and LLM component is introduced without a diagram or pseudocode; a single figure would clarify the iterative loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Approach): the headline coverage and speedup claims rest on the assertion that the LLM functions as a constraint solver, yet no prompt template, no example of a sliced trace constraint fed to the LLM, and no verification (e.g., via Z3 or concrete execution) that the generated inputs satisfy the original constraints are supplied. Without this evidence the observed gains cannot be confidently attributed to constraint solving rather than ancillary changes in mutation budget or scheduling.

Authors: We agree that the manuscript would benefit from greater transparency on the LLM invocation. In the revision we will add the exact prompt template and a worked example of a sliced execution trace together with the LLM-generated input. On verification, the current evaluation relies on end-to-end coverage and bug-finding results rather than per-constraint Z3 checks; we will add a short discussion of this design choice and, where feasible, include concrete-execution validation for a sample of generated inputs. The attribution to constraint solving follows directly from the architecture in which the LLM replaces the conventional solver after trace slicing. revision: yes
Referee: [§5] §5 (Evaluation): the quantitative results are presented without any description of experimental controls, number of independent runs, statistical significance tests, seed-selection protocol, or precise baseline configurations (e.g., timeout, mutation parameters, or environment settings for CoFuzz/Intriguer/QSYM). This absence directly undermines the soundness of the reported 31–59% coverage deltas.

Authors: We accept that these methodological details are required for reproducibility. The revised evaluation section will explicitly state the number of independent runs performed, the statistical tests applied, the seed-selection protocol (identical initial seeds across all tools), and the precise versions, timeouts, and mutation parameters used for each baseline. revision: yes
Referee: [§4–5] §4–5: the claim that LLM-based concolic execution is 3–19× faster is load-bearing for the overall contribution, yet the paper supplies neither the measurement methodology (wall-clock vs. CPU time, inclusion/exclusion of LLM API latency) nor any breakdown showing that the speedup is not simply an artifact of reduced path-exploration depth.

Authors: We will augment §§4–5 with a clear description of the timing methodology (wall-clock time inclusive of API latency) and will add a per-phase breakdown comparing LLM-based solving against the conventional concolic engines on matched sets of paths. This will demonstrate that the reported speedup is not an artifact of shallower exploration. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external tool comparisons, not self-defined quantities

full rationale

The paper reports experimental coverage gains (31.43–59.48%) and speedups (3–19×) from running HyllFuzz against CoFuzz, Intriguer, and QSYM on real-world subjects, plus seven new bugs. These are direct measurement outcomes from an implemented system, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No mathematical derivation, ansatz, or uniqueness theorem is invoked; the central claim (LLM as constraint solver after trace slicing) is presented as an engineering hypothesis validated by external benchmarks rather than reduced to the authors' prior definitions or inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs possess usable code-semantic knowledge sufficient to generate valid inputs without explicit environment models.

axioms (1)

domain assumption LLMs possess sufficient knowledge of code semantics to solve constraints for input generation from execution traces
Invoked when the paper positions the LLM as a solver that avoids conventional concolic execution difficulties (abstract).

pith-pipeline@v0.9.0 · 5876 in / 1227 out tokens · 45063 ms · 2026-05-23T07:08:41.959858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Redqueen: Fuzzing with input-to-state correspondence

Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, and Thorsten Holz. Redqueen: Fuzzing with input-to-state correspondence. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS) , volume 19, pages 1–15, 2019

work page 2019
[2]

Fuzzing: Challenges and reflections

Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. Fuzzing: Challenges and reflections. IEEE Software, 38(3):79– 86, 2020

work page 2020
[3]

On the reliability of coverage-based fuzzer benchmarking

Marcel Böhme, László Szekeres, and Jonathan Metzman. On the reliability of coverage-based fuzzer benchmarking. In Proceedings of the 44th International Conference on Software Engineering (ICSE) , pages 1621–1633, 2022

work page 2022
[4]

Böhme, C

M. Böhme, C. Cadar, and A. Roychoudhury. Fuzzing: Challenges and reflections. IEEE Software, 38(3), 2021

work page 2021
[5]

Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 209–224, 2008

work page 2008
[6]

Exe: Automatically generating inputs of death

Cristian Cadar, Vijay Ganesh, Peter M Pawlowski, David L Dill, and Dawson R Engler. Exe: Automatically generating inputs of death. ACM Transactions on Information and System Security (TISSEC) , 12(2):1–38, 2008

work page 2008
[7]

Symbolic execution for software testing: three decades later

Cristian Cadar and Koushik Sen. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82–90, 2013

work page 2013
[8]

{MEUZZ}: Smart seed scheduling for hybrid fuzzing

Yaohui Chen, Mansour Ahmadi, Boyu Wang, Long Lu, et al. {MEUZZ}: Smart seed scheduling for hybrid fuzzing. In Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages 77–92, 2020

work page 2020
[9]

Chatunitest: A framework for llm-based test generation

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), pages 572–576, 2024

work page 2024
[10]

S2e: A platform for in-vivo multi-path analysis of software systems

Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. S2e: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 265–278, 2011

work page 2011
[11]

Intriguer: Field-level constraint solving for hybrid fuzzing

Mingi Cho, Seoyoung Kim, and Taekyoung Kwon. Intriguer: Field-level constraint solving for hybrid fuzzing. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS) , pages 515–530, 2019

work page 2019
[12]

Coverage-guided, in-process fuzzing for the jvm

Code-Intelligence. Coverage-guided, in-process fuzzing for the jvm. https://github.com/CodeIntelligenceTesting/jazzer

work page
[13]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis (ISSTA) , pages 423–435, 2023

work page 2023
[14]

AFL++ : Combining incremental steps of fuzzing research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++ : Combining incremental steps of fuzzing research. In Proceedings of the 14th USENIX Workshop on Offensive Technologies , 2020

work page 2020
[15]

Collafl: Path sensitive fuzzing

Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. Collafl: Path sensitive fuzzing. In Proceedings of the 39th IEEE Symposium on Security and Privacy (IEEE S&P) , pages 679–696, 2018

work page 2018
[16]

Dart: Directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. Dart: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI) , pages 213–223, 2005

work page 2005
[17]

Large language models for software engineering: A systematic literature review

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 2023

work page 2023
[18]

Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction

Heqing Huang, Peisen Yao, Rongxin Wu, Qingkai Shi, and Charles Zhang. Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction. In Proceedings of the 41st IEEE Symposium on Security and Privacy (IEEE S&P) , pages 1613–1627, 2020

work page 2020
[19]

Towards understanding the effectiveness of large language models on directed test input generation

Zongze Jiang, Ming Wen, Jialun Cao, Xuanhua Shi, and Hai Jin. Towards understanding the effectiveness of large language models on directed test input generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1408–1420, 2024

work page 2024
[20]

A segmented memory model for symbolic execution

Timotej Kapus and Cristian Cadar. A segmented memory model for symbolic execution. InProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 774–784, 2019

work page 2019
[21]

Evaluating fuzz testing

George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security (CCS) , pages 2123–2138, 2018

work page 2018
[22]

{UNIFUZZ }: A holistic and pragmatic {Metrics-Driven} platform for evaluating fuzzers

Yuwei Li, Shouling Ji, Yuan Chen, Sizhuang Liang, Wei-Han Lee, Yueyao Chen, Chenyang Lyu, Chunming Wu, Raheem Beyah, Peng Cheng, et al. {UNIFUZZ }: A holistic and pragmatic {Metrics-Driven} platform for evaluating fuzzers. In Proceedings of the 30th USENIX Security Symposium (USENIX Security) , pages 2777–2794, 2021

work page 2021
[23]

Hybrid concolic testing

Rupak Majumdar and Koushik Sen. Hybrid concolic testing. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 416–426, 2007. , Vol. 1, No. 1, Article . Publication date: December 2024. 20 Ruijie Meng, Gregory J. Duck, and Abhik Roychoudhury

work page 2007
[24]

Large language model guided protocol fuzzing

Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS) , 2024

work page 2024
[25]

Fuzzbench: an open fuzzer benchmarking platform and service

Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE) , pages 1393–1403, 2021

work page 2021
[26]

Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

work page arXiv 2024
[27]

Transformed vargha-delaney effect size

Geoffrey Neumann, Mark Harman, and Simon Poulding. Transformed vargha-delaney effect size. In Proceedings of Search-Based Software Engineering: 7th International Symposium , pages 318–324, 2015

work page 2015
[28]

Coverup: Coverage-guided llm-based test generation

Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage-guided llm-based test generation. arXiv preprint arXiv:2403.16218, 2024

work page arXiv 2024
[29]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm

Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024

work page 2024
[31]

An empirical evaluation of using large language models for automated unit test generation

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering , 2023

work page 2023
[32]

Cute: A concolic unit testing engine for c

Koushik Sen, Darko Marinov, and Gul Agha. Cute: A concolic unit testing engine for c. ACM SIGSOFT Software Engineering Notes, 30(5):263–272, 2005

work page 2005
[33]

Llm4fuzz: Guided fuzzing of smart contracts with large language models

Chaofan Shou, Jing Liu, Doudou Lu, and Koushik Sen. Llm4fuzz: Guided fuzzing of smart contracts with large language models. arXiv preprint arXiv:2401.11108, 2024

work page arXiv 2024
[34]

Driller: Augmenting fuzzing through selective symbolic execution

Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In Proceedings of the 2016 Network and Distributed System Security Symposium (NDSS) , volume 16, pages 1–16, 2016

work page 2016
[35]

Fuzz4all: Universal fuzzing with large language models

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 1–13, 2024

work page 2024
[36]

On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 1607–1619, 2024

work page 2024
[37]

{QSYM}: A practical concolic execution engine tailored for hybrid fuzzing

Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. {QSYM}: A practical concolic execution engine tailored for hybrid fuzzing. In Proceedings of the 27th USENIX Security Symposium (USENIX Security) , pages 745–761, 2018

work page 2018
[38]

Michał Zalewski. AFL. https://lcamtuf.coredump.cx/afl/

work page
[39]

Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing

Lei Zhao, Yue Duan, and Jifeng XUAN. Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS) , 2019. , Vol. 1, No. 1, Article . Publication date: December 2024

work page 2019

[1] [1]

Redqueen: Fuzzing with input-to-state correspondence

Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, and Thorsten Holz. Redqueen: Fuzzing with input-to-state correspondence. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS) , volume 19, pages 1–15, 2019

work page 2019

[2] [2]

Fuzzing: Challenges and reflections

Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. Fuzzing: Challenges and reflections. IEEE Software, 38(3):79– 86, 2020

work page 2020

[3] [3]

On the reliability of coverage-based fuzzer benchmarking

Marcel Böhme, László Szekeres, and Jonathan Metzman. On the reliability of coverage-based fuzzer benchmarking. In Proceedings of the 44th International Conference on Software Engineering (ICSE) , pages 1621–1633, 2022

work page 2022

[4] [4]

Böhme, C

M. Böhme, C. Cadar, and A. Roychoudhury. Fuzzing: Challenges and reflections. IEEE Software, 38(3), 2021

work page 2021

[5] [5]

Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 209–224, 2008

work page 2008

[6] [6]

Exe: Automatically generating inputs of death

Cristian Cadar, Vijay Ganesh, Peter M Pawlowski, David L Dill, and Dawson R Engler. Exe: Automatically generating inputs of death. ACM Transactions on Information and System Security (TISSEC) , 12(2):1–38, 2008

work page 2008

[7] [7]

Symbolic execution for software testing: three decades later

Cristian Cadar and Koushik Sen. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82–90, 2013

work page 2013

[8] [8]

{MEUZZ}: Smart seed scheduling for hybrid fuzzing

Yaohui Chen, Mansour Ahmadi, Boyu Wang, Long Lu, et al. {MEUZZ}: Smart seed scheduling for hybrid fuzzing. In Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages 77–92, 2020

work page 2020

[9] [9]

Chatunitest: A framework for llm-based test generation

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), pages 572–576, 2024

work page 2024

[10] [10]

S2e: A platform for in-vivo multi-path analysis of software systems

Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. S2e: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 265–278, 2011

work page 2011

[11] [11]

Intriguer: Field-level constraint solving for hybrid fuzzing

Mingi Cho, Seoyoung Kim, and Taekyoung Kwon. Intriguer: Field-level constraint solving for hybrid fuzzing. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS) , pages 515–530, 2019

work page 2019

[12] [12]

Coverage-guided, in-process fuzzing for the jvm

Code-Intelligence. Coverage-guided, in-process fuzzing for the jvm. https://github.com/CodeIntelligenceTesting/jazzer

work page

[13] [13]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis (ISSTA) , pages 423–435, 2023

work page 2023

[14] [14]

AFL++ : Combining incremental steps of fuzzing research

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++ : Combining incremental steps of fuzzing research. In Proceedings of the 14th USENIX Workshop on Offensive Technologies , 2020

work page 2020

[15] [15]

Collafl: Path sensitive fuzzing

Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. Collafl: Path sensitive fuzzing. In Proceedings of the 39th IEEE Symposium on Security and Privacy (IEEE S&P) , pages 679–696, 2018

work page 2018

[16] [16]

Dart: Directed automated random testing

Patrice Godefroid, Nils Klarlund, and Koushik Sen. Dart: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI) , pages 213–223, 2005

work page 2005

[17] [17]

Large language models for software engineering: A systematic literature review

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 2023

work page 2023

[18] [18]

Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction

Heqing Huang, Peisen Yao, Rongxin Wu, Qingkai Shi, and Charles Zhang. Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction. In Proceedings of the 41st IEEE Symposium on Security and Privacy (IEEE S&P) , pages 1613–1627, 2020

work page 2020

[19] [19]

Towards understanding the effectiveness of large language models on directed test input generation

Zongze Jiang, Ming Wen, Jialun Cao, Xuanhua Shi, and Hai Jin. Towards understanding the effectiveness of large language models on directed test input generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1408–1420, 2024

work page 2024

[20] [20]

A segmented memory model for symbolic execution

Timotej Kapus and Cristian Cadar. A segmented memory model for symbolic execution. InProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 774–784, 2019

work page 2019

[21] [21]

Evaluating fuzz testing

George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security (CCS) , pages 2123–2138, 2018

work page 2018

[22] [22]

{UNIFUZZ }: A holistic and pragmatic {Metrics-Driven} platform for evaluating fuzzers

Yuwei Li, Shouling Ji, Yuan Chen, Sizhuang Liang, Wei-Han Lee, Yueyao Chen, Chenyang Lyu, Chunming Wu, Raheem Beyah, Peng Cheng, et al. {UNIFUZZ }: A holistic and pragmatic {Metrics-Driven} platform for evaluating fuzzers. In Proceedings of the 30th USENIX Security Symposium (USENIX Security) , pages 2777–2794, 2021

work page 2021

[23] [23]

Hybrid concolic testing

Rupak Majumdar and Koushik Sen. Hybrid concolic testing. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 416–426, 2007. , Vol. 1, No. 1, Article . Publication date: December 2024. 20 Ruijie Meng, Gregory J. Duck, and Abhik Roychoudhury

work page 2007

[24] [24]

Large language model guided protocol fuzzing

Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS) , 2024

work page 2024

[25] [25]

Fuzzbench: an open fuzzer benchmarking platform and service

Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE) , pages 1393–1403, 2021

work page 2021

[26] [26]

Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024

work page arXiv 2024

[27] [27]

Transformed vargha-delaney effect size

Geoffrey Neumann, Mark Harman, and Simon Poulding. Transformed vargha-delaney effect size. In Proceedings of Search-Based Software Engineering: 7th International Symposium , pages 318–324, 2015

work page 2015

[28] [28]

Coverup: Coverage-guided llm-based test generation

Juan Altmayer Pizzorno and Emery D Berger. Coverup: Coverage-guided llm-based test generation. arXiv preprint arXiv:2403.16218, 2024

work page arXiv 2024

[29] [29]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm

Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering , 1(FSE):951–971, 2024

work page 2024

[30] [31]

An empirical evaluation of using large language models for automated unit test generation

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering , 2023

work page 2023

[31] [32]

Cute: A concolic unit testing engine for c

Koushik Sen, Darko Marinov, and Gul Agha. Cute: A concolic unit testing engine for c. ACM SIGSOFT Software Engineering Notes, 30(5):263–272, 2005

work page 2005

[32] [33]

Llm4fuzz: Guided fuzzing of smart contracts with large language models

Chaofan Shou, Jing Liu, Doudou Lu, and Koushik Sen. Llm4fuzz: Guided fuzzing of smart contracts with large language models. arXiv preprint arXiv:2401.11108, 2024

work page arXiv 2024

[33] [34]

Driller: Augmenting fuzzing through selective symbolic execution

Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In Proceedings of the 2016 Network and Distributed System Security Symposium (NDSS) , volume 16, pages 1–16, 2016

work page 2016

[34] [35]

Fuzz4all: Universal fuzzing with large language models

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 1–13, 2024

work page 2024

[35] [36]

On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 1607–1619, 2024

work page 2024

[36] [37]

{QSYM}: A practical concolic execution engine tailored for hybrid fuzzing

Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, and Taesoo Kim. {QSYM}: A practical concolic execution engine tailored for hybrid fuzzing. In Proceedings of the 27th USENIX Security Symposium (USENIX Security) , pages 745–761, 2018

work page 2018

[37] [38]

Michał Zalewski. AFL. https://lcamtuf.coredump.cx/afl/

work page

[38] [39]

Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing

Lei Zhao, Yue Duan, and Jifeng XUAN. Send hardest problems my way: Probabilistic path prioritization for hybrid fuzzing. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS) , 2019. , Vol. 1, No. 1, Article . Publication date: December 2024

work page 2019