pith. sign in

arxiv: 2510.19984 · v1 · submitted 2025-10-22 · 💻 cs.SE

On Interaction Effects in Greybox Fuzzing

Pith reviewed 2026-05-18 04:17 UTC · model grok-4.3

classification 💻 cs.SE
keywords greybox fuzzingmutator interactionsinteraction effectssoftware testingcode coveragebug detectionfuzz testingmutation sequences
0
0 comments X

The pith

Mutator order matters in greybox fuzzing because pairwise interactions affect coverage gains, and learning conditional probabilities to pick sequences improves results over fixed or independent choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the sequence of mutators applied to a seed input changes how much new code a greybox fuzzer covers. The authors collect data on every possible pair of mutators, fit a linear model, and observe clear interaction effects where the success of one mutator depends on the one applied before it. They then build MuoFuzz, which estimates the conditional probability that the next mutator will produce an interesting input and draws full sequences from this model using a random walk. If the approach holds, fuzzers that currently pick mutators with fixed probabilities or optimize each one alone can be outperformed by explicitly accounting for these ordered dependencies.

Core claim

Greybox fuzzers become more effective when they learn and exploit interaction effects among mutators by estimating the conditional probability that a given mutator will yield coverage gains after the previous one, then sampling promising sequences via random walk rather than using fixed selection probabilities or independent optimization.

What carries the argument

A conditional probability model over ordered mutator pairs, sampled by random walk to produce full mutation sequences for seed inputs.

If this is right

  • MuoFuzz records the highest code coverage on both the FuzzBench and MAGMA benchmark suites.
  • MuoFuzz discovers four bugs that AFL++ misses and one bug missed by both AFL++ and MOPT.
  • Choosing mutator sequences according to learned conditional probabilities is more efficient than fixed-probability selection or per-mutator optimization alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairwise-interaction modeling could be applied to other mutation-driven search processes such as evolutionary test generation or program repair.
  • If the learned probabilities shift as the corpus grows, periodic re-estimation during a campaign might further increase gains.
  • The linear-model evidence for interactions suggests that similar dependencies may exist between other fuzzing decisions, such as seed scheduling or energy assignment.

Load-bearing premise

Interaction effects measured from short, isolated mutator-pair trials on a small set of seeds will still guide better choices when the model runs inside full-length fuzzing campaigns on many different programs.

What would settle it

Running MuoFuzz, AFL++, and MOPT head-to-head on the FuzzBench and MAGMA suites and finding that MuoFuzz does not produce the highest coverage or the unique bugs reported would show the learned sequences add no practical gain.

Figures

Figures reproduced from arXiv: 2510.19984 by Alberto Bacchelli, Konstantinos Kitsios, Marcel B\"ohme.

Figure 1
Figure 1. Figure 1: The left figure shows the number of interesting [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level overview of MuoFuzz. The division with the sum turns each row into a probability dis￾tribution that sums to 1. The distribution for the freetype program is shown on the right side of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

A greybox fuzzer is an automated software testing tool that generates new test inputs by applying randomly chosen mutators (e.g., flipping a bit or deleting a block of bytes) to a seed input in random order and adds all coverage-increasing inputs to the corpus of seeds. We hypothesize that the order in which mutators are applied to a seed input has an impact on the effectiveness of greybox fuzzers. In our experiments, we fit a linear model to a dataset that contains the effectiveness of all possible mutator pairs and indeed observe the conjectured interaction effect. This points us to more efficient fuzzing by choosing the most promising mutator sequence with a higher likelihood. We propose MuoFuzz, a greybox fuzzer that learns and chooses the most promising mutator sequences. MuoFuzz learns the conditional probability that the next mutator will yield an interesting input, given the previously selected mutator. Then, it samples from the learned probability using a random walk to generate mutator sequences. We compare the performance of MuoFuzz to AFL++, which uses a fixed selection probability, and MOPT, which optimizes the selection probability of each mutator in isolation. Experimental results on the FuzzBench and MAGMA benchmarks show that MuoFuzz achieves the highest code coverage and finds four bugs missed by AFL++ and one missed by both AFL++ and MOPT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper hypothesizes that the sequence in which mutators are applied affects greybox fuzzer effectiveness due to interaction effects. It fits a linear model to pairwise mutator effectiveness data collected on limited seeds and reports observing such interactions. It then proposes MuoFuzz, which learns conditional probabilities P(next mutator is interesting | previous mutator) from coverage data and samples mutator sequences via random walk. On FuzzBench and MAGMA, MuoFuzz reports higher coverage than AFL++ (fixed probabilities) and MOPT (independent optimization) and discovers bugs missed by the baselines.

Significance. If the results hold, the work supplies empirical support for the value of modeling mutator interactions rather than treating selection probabilities as independent. The evaluation on standard, reproducible benchmarks (FuzzBench, MAGMA) is a strength and enables direct comparison with prior fuzzers. The approach of learning conditionals from observed coverage outcomes is falsifiable and grounded in data rather than hand-tuned parameters.

major comments (2)
  1. [Evaluation] The pairwise interaction measurements are performed on restricted seed sets in isolation, yet the central performance claims rest on MuoFuzz using the resulting conditional probabilities inside full, long-running campaigns on evolving corpora. No ablation or transfer experiment is reported that checks whether the learned conditionals remain predictive once the corpus diversifies and runtimes extend, leaving open whether observed gains are due to interaction modeling or other adaptive mechanics.
  2. [MuoFuzz Design] The manuscript does not state whether the conditional probability table is computed once from an initial dataset or is updated online as new coverage-increasing inputs are discovered during a campaign. This detail is load-bearing for the claim that interaction effects observed in the linear model translate to sustained gains in the deployed fuzzer.
minor comments (2)
  1. [Abstract] The abstract states that MuoFuzz finds four bugs missed by AFL++ and one missed by both AFL++ and MOPT, but the results section should list the specific bugs, the programs on which they were found, and the exact run counts used for each comparison.
  2. [Experiments] Statistical controls (number of independent runs, variance reporting, or significance tests) for the coverage curves are not described in sufficient detail to allow readers to judge the robustness of the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our evaluation and design that we address point by point below. We provide the strongest honest clarifications supported by the manuscript while indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] The pairwise interaction measurements are performed on restricted seed sets in isolation, yet the central performance claims rest on MuoFuzz using the resulting conditional probabilities inside full, long-running campaigns on evolving corpora. No ablation or transfer experiment is reported that checks whether the learned conditionals remain predictive once the corpus diversifies and runtimes extend, leaving open whether observed gains are due to interaction modeling or other adaptive mechanics.

    Authors: The pairwise measurements on restricted seed sets were intentionally performed in isolation to establish the existence of interaction effects via the linear model, as stated in the manuscript. The performance claims for MuoFuzz are evaluated exclusively through full, long-running campaigns on the FuzzBench and MAGMA benchmarks, where the learned conditional probabilities are applied directly to evolving corpora over extended runtimes. These experiments demonstrate higher coverage and additional unique bugs compared to AFL++ (fixed probabilities) and MOPT (independent optimization). While we did not report a dedicated ablation isolating the transferability of the conditionals, the end-to-end results on standard benchmarks already reflect usage in realistic, diversifying settings and support that interaction modeling contributes to the observed gains beyond other adaptive mechanics. We will add a clarifying paragraph in the Evaluation section discussing the applicability of the pre-learned conditionals during long campaigns. revision: partial

  2. Referee: [MuoFuzz Design] The manuscript does not state whether the conditional probability table is computed once from an initial dataset or is updated online as new coverage-increasing inputs are discovered during a campaign. This detail is load-bearing for the claim that interaction effects observed in the linear model translate to sustained gains in the deployed fuzzer.

    Authors: We thank the referee for identifying this missing detail. In MuoFuzz, the conditional probability table is computed once from the initial coverage dataset collected on limited seeds (the same data underlying the linear model) and is not updated online during the campaign. This static design directly translates the observed pairwise interactions into sequence sampling via random walk without additional runtime adaptation. We will revise the MuoFuzz Design section to explicitly describe the one-time computation process, its grounding in the initial data, and how it enables sustained use of the learned interactions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins with a hypothesis on mutator ordering, supports it via an empirical linear model fit to a dataset of mutator-pair effectiveness (used only to observe interaction effects), and then implements MuoFuzz to learn conditional probabilities for sequencing and evaluates the resulting fuzzer directly on FuzzBench and MAGMA benchmarks. No equation or claim reduces a reported performance number or 'prediction' to the initial model fit by construction; the benchmark results are generated from independent full fuzzing campaigns rather than being statistically forced by the motivating linear model. The approach remains self-contained against external benchmarks with no load-bearing self-citation or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on empirical data fitting for both the linear model and the conditional probabilities; no new physical or mathematical axioms are introduced beyond standard assumptions in coverage-guided fuzzing.

free parameters (1)
  • conditional probabilities for mutator transitions
    These are learned from coverage outcomes during the training phase and directly determine the random-walk sampling behavior.
axioms (1)
  • domain assumption Coverage increase is a reliable proxy for interesting inputs that advance the fuzzing campaign.
    Invoked when defining the target of the learned conditional probability.

pith-pipeline@v0.9.0 · 5780 in / 1288 out tokens · 29794 ms · 2026-05-18T04:17:44.276534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report

  2. [2]

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem.Machine learning47 (2002), 235–256

  3. [3]

    Yoav Benjamini and Yosef Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.Journal of the Royal Statistical Society: Series B (Methodological)57, 1 (1995), 289–300. https: //doi.org/10.1111/j.2517-6161.1995.tb02031.x

  4. [4]

    Marcel Böhme and Brandon Falk. 2020. Fuzzing: On the exponential cost of vulnerability discovery. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 713–724

  5. [5]

    Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury

  6. [6]

    InProceedings of the 2017 ACM SIGSAC conference on computer and communications security

    Directed greybox fuzzing. InProceedings of the 2017 ACM SIGSAC conference on computer and communications security. 2329–2344

  7. [7]

    Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2016. Coverage- based greybox fuzzing as markov chain. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 1032–1043. ICSE 2026, April 12–18, 2026, Rio de Janeiro, Brazil Konstantinos Kitsios, Marcel Böhme, and Alberto Bacchelli

  8. [8]

    Brown, Vincent J

    Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-Based n-gram Models of Natural Language.Compu- tational Linguistics18, 4 (1992), 467–479. https://www.aclweb.org/anthology/J92- 4003/

  9. [9]

    2006.Direct methods for sparse linear systems

    Timothy A Davis. 2006.Direct methods for sparse linear systems. SIAM

  10. [10]

    Draper and Harry Smith

    Norman R. Draper and Harry Smith. 1998. Applied Regression Analysis.Wiley Series in Probability and Statistics3 (1998)

  11. [11]

    Jeffrey L. Elman. 1990. Finding Structure in Time.Cognitive Science14, 2 (1990), 179–211. https://doi.org/10.1207/s15516709cog1402_1

  12. [12]

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. {AFL++}: Combining incremental steps of fuzzing research. In14th USENIX Workshop on Offensive Technologies (WOOT 20)

  13. [13]

    R. A. Fisher. 1925. Statistical Methods for Research Workers.Edinburgh Oliver and Boyd12 (1925), 356–369

  14. [14]

    Vasudev Gohil, Rahul Kande, Chen Chen, Ahmad-Reza Sadeghi, and Jeyavijayan Rajendran. 2024. Mabfuzz: Multi-armed bandit algorithms for fuzzing processors. In2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6

  15. [15]

    2016.Deep Learning

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.Deep Learning. MIT Press. http://www.deeplearningbook.org

  16. [16]

    Google. n.d.. libFuzzer – A Library for Coverage-Guided Fuzz Testing. https: //llvm.org/docs/LibFuzzer.html. Accessed: 2024-08-25

  17. [17]

    Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground- truth fuzzing benchmark.Proceedings of the ACM on Measurement and Analysis of Computing Systems4, 3 (2020), 1–29

  18. [18]

    Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. InProceedings of the 36th international conference on software engineering. 435–445

  19. [19]

    Patrick Jauernig, Domagoj Jakobovic, Stjepan Picek, Emmanuel Stapf, and Ahmad- Reza Sadeghi. 2022. DARWIN: Survival of the fittest fuzzing mutators.arXiv preprint arXiv:2210.11783(2022)

  20. [20]

    On Interaction Effects in Greybox Fuzzing

    Konstantinos Kitsios, Marcel Böhme, and Alberto Bacchelli. 2025.Replication Package for "On Interaction Effects in Greybox Fuzzing". https://doi.org/10.5281/ zenodo.17391101

  21. [21]

    George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security. 2123–2138

  22. [22]

    Yuki Koike, Hiroyuki Katsura, Hiromu Yakura, and Yuma Kurogome. 2022. Slopt: Bandit optimization framework for mutation-based fuzzing. InProceedings of the 38th Annual Computer Security Applications Conference. 519–533

  23. [23]

    Mario Köppen. 2000. The curse of dimensionality. In5th online world conference on soft computing in industrial applications (WSC5), Vol. 1. 4–8

  24. [24]

    Kutner, Christopher J

    Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li. 2005. Applied Linear Statistical Models(5th ed.). McGraw-Hill/Irwin, Boston, MA

  25. [25]

    2020.Bandit Algorithms

    Tor Lattimore and Csaba Szepesvári. 2020.Bandit Algorithms. Cambridge Univer- sity Press, Cambridge, UK. https://tor-lattimore.com/downloads/book/book.pdf

  26. [26]

    Myungho Lee, Sooyoung Cha, and Hakjoo Oh. 2023. Learning seed-adaptive mutation strategies for greybox fuzzing. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 384–396

  27. [27]

    Caroline Lemieux and Koushik Sen. 2018. Fairfuzz: A targeted mutation strategy for increasing greybox fuzz testing coverage. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 475–485

  28. [28]

    Yuekang Li, Yinxing Xue, Hongxu Chen, Xiuheng Wu, Cen Zhang, Xiaofei Xie, Haijun Wang, and Yang Liu. 2019. Cerebro: context-aware adaptive fuzzing for effective vulnerability detection. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 533–544

  29. [29]

    Jie Liang, Mingzhe Wang, Chijin Zhou, Zhiyong Wu, Jianzhong Liu, and Yu Jiang

  30. [30]

    InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

    Dodrio: Parallelizing Taint Analysis Based Fuzzing via Redundancy-Free Scheduling. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 244–254

  31. [31]

    Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and Raheem Beyah. 2019. MOPT: Optimized Mutation Scheduling for Fuzzers. In28th USENIX Security Symposium (USENIX Security 19). USENIX Association, Santa Clara, CA, 1949–1966. https://www.usenix.org/conference/usenixsecurity19/ presentation/lyu

  32. [32]

    Mann and Donald R

    Henry B. Mann and Donald R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50–60. https://doi.org/10.1214/aoms/ 1177730491

  33. [33]

    Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1393–1403

  34. [34]

    Maria-Irina Nicolae, Max Eisele, and Andreas Zeller. 2023. Revisiting neural program smoothing for fuzzing. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 133–145

  35. [35]

    J. R. Norris. 1998.Markov Chains. Cambridge University Press, Cambridge, UK

  36. [36]

    Pennsylvania State University. 2025. ANOVA Assumptions. https://online.stat. psu.edu/stat500/lesson/10/10.2/10.2.1 Accessed: 2025-03-05

  37. [37]

    Ruixiang Qian, Quanjun Zhang, Chunrong Fang, Ding Yang, Shun Li, Binyu Li, and Zhenyu Chen. 2024. DiPri: Distance-based Seed Prioritization for Greybox Fuzzing.ACM Transactions on Software Engineering and Methodology(2024)

  38. [38]

    Sanjay Rawat, Vivek Jain, Ashish Kumar, Lucian Cojocar, Cristiano Giuffrida, and Herbert Bos. 2017. VUzzer: Application-aware evolutionary fuzzing.. InNDSS, Vol. 17. 1–14

  39. [39]

    Moritz Schloegel, Nils Bars, Nico Schiller, Lukas Bernhard, Tobias Scharnowski, Addison Crump, Arash Ale Ebrahim, Nicolai Bissantz, Marius Muench, and Thorsten Holz. 2024. SoK: Prudent Evaluation Practices for Fuzzing.arXiv preprint arXiv:2405.10220(2024)

  40. [40]

    Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statis- tical modeling with python. In9th Python in Science Conference. https://www. statsmodels.org/stable/generated/statsmodels.stats.anova.anova_lm.html

  41. [41]

    Dongdong She, Rahul Krishna, Lu Yan, Suman Jana, and Baishakhi Ray. 2020. MTFuzz: fuzzing with a multi-task neural network. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 737–749

  42. [42]

    Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2019. Neuzz: Efficient fuzzing with neural program smoothing. In2019 IEEE Symposium on Security and Privacy (SP). IEEE, 803–817

  43. [43]

    Robert Swiecki. 2015. Honggfuzz: A general-purpose, easy-to-use fuzzer with simple, command-line interface. https://github.com/google/honggfuzz. Accessed: 2024-08-25

  44. [44]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971(2023)

  45. [45]

    András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132

  46. [46]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. InProceedings of the 31st International Conference on Neural In- formation Processing Systems (NeurIPS). Curran Associates, Inc., 6000–6010. https://arxiv.org/abs/1706.03762

  47. [47]

    Xiaohong Wang, Chengcheng Hu, Ruopeng Ma, et al. 2021. CMFuzz: context- aware adaptive mutation for fuzzers.Empirical Software Engineering26, 10 (2021), 1–28. https://doi.org/10.1007/s10664-020-09927-3

  48. [48]

    Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yanwei Huang, Heming Cui, Lingming Zhang, and Yuqun Zhang. 2022. One fuzzing strategy to rule them all. InPro- ceedings of the 44th International Conference on Software Engineering. 1634–1645

  49. [49]

    Michal Zalewski. 2014. American Fuzzy Lop (AFL). https://github.com/google/ AFL. Accessed: 2024-08-25

  50. [50]

    Andreas Zeller, Rahul Gopinath, Marcel Böhme, Christian Holler, Caroline Lemieux, and Zhendong Su. 2024. Greybox Fuzzing. InThe Fuzzing Book, An- dreas Zeller, Rahul Gopinath, Marcel Böhme, Christian Holler, Caroline Lemieux, and Zhendong Su (Eds.). Fuzzingbook.org. https://www.fuzzingbook.org/html/ GreyboxFuzzer.html

  51. [51]

    Han Zheng, Flavio Toffalini, Marcel Böhme, and Mathias Payer. 2025. MendelFuzz: The Return of the Deterministic Stage.Proceedings of the ACM on Software Engineering2, FSE (2025), 44–64