AFL-ICP: Enhancing Industrial Control Protocol Reliability via Specification-Guided Fuzzing
Pith reviewed 2026-05-08 17:43 UTC · model grok-4.3
The pith
AFL-ICP formalizes industrial control protocol specifications into executable grammars and uses LLMs to guide fuzzing and detect semantic bugs missed by existing tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AFL-ICP pioneers a specification-driven paradigm for fuzzing industrial control protocols. It features a context-aware specification formalization pipeline to transform complex specifications into rigorous machine-executable grammars. Building on this formalized specification, AFL-ICP leverages LLMs to enable automated protocol adaptation and seed generation, allowing for rapid extension to new protocols with minimal manual effort. Additionally, it includes an LLM-powered differential checker that cross-references implementation outputs with specification requirements to detect subtle semantic and logic bugs that existing fuzzers cannot detect.
What carries the argument
The context-aware specification formalization pipeline that produces machine-executable grammars from protocol specifications, paired with an LLM-powered differential checker that compares run-time outputs to those grammars to flag semantic deviations.
If this is right
- The LLM automation allows new protocols to be supported with far less manual grammar writing than traditional fuzzers require.
- The differential checker finds 16 semantic and logic bugs that can disrupt industrial operations without obvious crashes.
- Overall coverage exceeds that of state-of-the-art observation-driven fuzzers on both open- and closed-source ICP implementations.
- Vendor acknowledgments for the 24 vulnerabilities indicate the bugs are reproducible enough to warrant fixes.
Where Pith is reading between the lines
- The same formalization-plus-differential-check approach could be applied to other specification-rich domains such as automotive bus protocols or medical device interfaces.
- Generated grammars could serve as the basis for model-based test generation beyond pure fuzzing, for example to prove absence of certain state transitions.
- If the LLM formalization step proves stable across model versions, the framework could lower the barrier for smaller vendors to perform rigorous protocol testing.
Load-bearing premise
The LLM components produce accurate grammars and trustworthy bug reports without systematic false positives or missed vulnerabilities that would erase the reported coverage gains.
What would settle it
An independent team re-runs AFL-ICP on the same four protocols, manually verifies the 24 reported vulnerabilities, and measures whether coverage and bug counts remain higher than baseline fuzzers when the LLM formalization or checker is disabled.
Figures
read the original abstract
Industrial Control Protocols (ICPs) are critical to the reliability and stability of industrial infrastructure, yet their security is fundamentally compromised by a specification-blindness bottleneck. Modern fuzzers, constrained by observation-driven inference, struggle to penetrate deep protocol states or detect subtle semantic deviations. In this paper, we present AFL-ICP, an autonomous fuzzing framework that pioneers a specification-driven paradigm. AFL-ICP features a context-aware specification formalization pipeline to transform complex specifications into rigorous machine-executable grammars. Building on this formalized specification, AFL-ICP leverages LLMs to enable automated protocol adaptation and seed generation, allowing for rapid extension to new protocols with minimal manual effort. Additionally, it includes an LLM-powered differential checker that cross-references implementation outputs with specification requirements to detect subtle semantic and logic bugs that existing fuzzers cannot detect. We implement AFL-ICP and evaluate it on four widely used ICPs, including both open-source and closed-source variants. Results show that AFL-ICP significantly outperforms state-of-the-art fuzzers in coverage and uncovers 24 previously unknown vulnerabilities, for which we have received acknowledgments from affected vendors (e.g., FreyrSCADA). Specifically, the identified vulnerabilities include 16 semantic and logic bugs that can silently disrupt industrial operations and degrade service availability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AFL-ICP, a specification-guided fuzzing framework for Industrial Control Protocols that uses a context-aware LLM pipeline to convert complex specs into executable grammars, automates protocol adaptation and seed generation, and deploys an LLM-based differential checker to identify semantic and logic bugs beyond what observation-driven fuzzers can reach. Evaluation on four ICPs (open- and closed-source) claims substantially higher coverage than state-of-the-art fuzzers plus discovery of 24 previously unknown vulnerabilities (16 semantic/logic), supported by vendor acknowledgments such as from FreyrSCADA.
Significance. If the central performance and bug-finding claims are substantiated, the work would meaningfully advance automated security analysis of critical infrastructure protocols by addressing the specification-blindness limitation of existing fuzzers. The combination of LLM-driven formalization, low-manual-effort adaptation, and semantic deviation detection targets a practically important gap; reproducible artifacts or quantified validation of the LLM components would further strengthen its contribution.
major comments (3)
- [Abstract and Evaluation] Abstract and Evaluation section: the reported coverage gains and 24-vulnerability count rest on the LLM formalization and differential checker, yet no accuracy metrics (grammar fidelity vs. manual spec, checker precision/recall), ablation studies, or false-positive/negative rates are supplied, leaving the performance delta and semantic-bug validity difficult to assess.
- [Evaluation] Evaluation section: concrete results are stated for four protocols, but baseline fuzzer configurations, coverage measurement methodology (e.g., which states/messages counted), seed-selection strategy, and run-time parameters are not detailed, preventing reproduction or attribution of gains specifically to the specification-guided components.
- [Differential Checker] LLM-powered differential checker description: the claim that it reliably distinguishes true semantic/logic deviations from noise is load-bearing for the 16 semantic bugs reported; without an independent validation set or quantified error analysis, the bug counts risk inflation from checker hallucinations or over-sensitivity.
minor comments (1)
- [Methodology] The architecture diagram and pipeline description would benefit from explicit call-outs of which LLM prompts or post-processing steps enforce grammar constraints, to clarify how hallucinations are mitigated.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight important areas where additional details and validation can strengthen the presentation of our results. We address each major comment below and will incorporate the suggested improvements in a revised manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the reported coverage gains and 24-vulnerability count rest on the LLM formalization and differential checker, yet no accuracy metrics (grammar fidelity vs. manual spec, checker precision/recall), ablation studies, or false-positive/negative rates are supplied, leaving the performance delta and semantic-bug validity difficult to assess.
Authors: We agree that explicit accuracy metrics and ablation studies would improve the ability to assess the contributions of the LLM components. In the revised manuscript we will add a new subsection in the Evaluation section that reports (1) grammar fidelity measured by comparing a sample of LLM-generated grammars against manually authored reference grammars, (2) precision/recall/F1 of the differential checker on a held-out validation set of labeled deviations, and (3) ablation results that isolate the impact of the specification formalization pipeline and the differential checker on both coverage and bug discovery. These additions will be supported by data we have already collected from our experimental runs. revision: yes
-
Referee: [Evaluation] Evaluation section: concrete results are stated for four protocols, but baseline fuzzer configurations, coverage measurement methodology (e.g., which states/messages counted), seed-selection strategy, and run-time parameters are not detailed, preventing reproduction or attribution of gains specifically to the specification-guided components.
Authors: We acknowledge that the current Evaluation section lacks sufficient implementation-level detail for full reproducibility. In the revision we will expand this section to specify: the exact configurations and command-line parameters used for each baseline fuzzer, the coverage instrumentation and measurement method (branch coverage via LLVM and the precise definition of protocol states/messages counted), the seed-selection and mutation strategy (initial seeds derived directly from the formalized grammar with specification-constrained mutations), and all run-time parameters (duration, hardware platform, random-seed settings). These details will make it possible to attribute performance differences to the specification-guided elements and to reproduce the experiments. revision: yes
-
Referee: [Differential Checker] LLM-powered differential checker description: the claim that it reliably distinguishes true semantic/logic deviations from noise is load-bearing for the 16 semantic bugs reported; without an independent validation set or quantified error analysis, the bug counts risk inflation from checker hallucinations or over-sensitivity.
Authors: We recognize that the reliability of the differential checker is central to the validity of the 16 semantic and logic bugs. In the revised version we will add an independent validation set of 50 manually labeled deviation cases (balanced between true semantic issues and non-buggy outputs) and report precision, recall, and F1-score for the checker. We will also describe our mitigation approach (multiple LLM queries with majority voting plus targeted human review of flagged cases) and discuss the observed error modes. This analysis will be placed in the Differential Checker subsection and will provide quantitative support for the reported bug counts. revision: yes
Circularity Check
No circularity in empirical tool evaluation
full rationale
The paper presents AFL-ICP as an empirical fuzzing framework relying on LLM-based specification formalization, seed generation, and differential checking. All central claims (superior coverage, 24 new vulnerabilities) are supported by direct experimental measurements on four ICPs against SOTA baselines, with external vendor acknowledgments. No equations, predictions, or derivations exist that reduce results to fitted inputs, self-definitions, or self-citation chains. The evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
S. D. D. Anton, D. Fraunholz, D. Krohmer, D. Reti, D. Schneider, and H. D. Schotten, “The global state of security in industrial control systems: An empirical analysis of vulnerabilities around the world,”IoTJ, 2021
work page 2021
-
[2]
Aflnet: a greybox fuzzer for network protocols,
V .-T. Pham, M. B¨ohme, and A. Roychoudhury, “Aflnet: a greybox fuzzer for network protocols,” inICST, 2020
work page 2020
-
[3]
Stateafl: Greybox fuzzing for stateful network servers,
R. Natella, “Stateafl: Greybox fuzzing for stateful network servers,” Empirical Software Engineering, 2022
work page 2022
-
[4]
Large language model guided protocol fuzzing,
R. Meng, M. Mirchev, M. B¨ohme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inNDSS, 2024
work page 2024
-
[5]
G. Klees, A. Ruef, B. Cooper, S. Wei, and M. Hicks, “Evaluating fuzz testing,” inSIGSAC, 2018
work page 2018
-
[6]
Seed selection for successful fuzzing,
A. Herrera, H. Gunadi, S. Magrath, M. Norrish, M. Payer, and A. L. Hosking, “Seed selection for successful fuzzing,” inISSTA, 2021
work page 2021
-
[7]
Undefined behavior sanitizer for chromium,
T. C. Projects, “Undefined behavior sanitizer for chromium,” Website. http://www.chromium.org/developers/testing/undefinedbehaviorsanitizer, 2014
work page 2014
-
[8]
{AddressSanitizer}: A fast address sanity checker,
K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, “{AddressSanitizer}: A fast address sanity checker,” inUSENIX ATC, 2012
work page 2012
-
[9]
D. Song, J. Lettner, P. Rajasekaran, Y . Na, S. V olckaert, P. Larsen, and M. Franz, “Sok: Sanitizing for security,” inS&P, 2019
work page 2019
-
[10]
Mistral AI, “Mistral OCR,” https://mistral.ai, 2025, accessed: 2025-03-12
work page 2025
-
[11]
OpenAI, “Hello GPT-4o,” 2024, accessed: 2025-03-12. [Online]. Available: https://openai.com/index/hello-gpt-4o/
work page 2024
-
[12]
A. R. Aleksy Timin, “Eipscanner,” Website. https://github.com/ nimbuscontrols/EIPScanner
-
[13]
Polar: Function code aware fuzz testing of ics protocol,
Z. Luo, F. Zuo, Y . Jiang, J. Gao, X. Jiao, and J. Sun, “Polar: Function code aware fuzz testing of ics protocol,”TECS, 2019
work page 2019
-
[14]
Ics protocol fuzzing: Coverage guided packet crack and generation,
Z. Luo, F. Zuo, Y . Shen, X. Jiao, W. Chang, and Y . Jiang, “Ics protocol fuzzing: Coverage guided packet crack and generation,” inDAC, 2020
work page 2020
-
[15]
Vulnerability detection of ics protocols via cross-state fuzzing,
F. Zuo, Z. Luo, J. Yu, T. Chen, Z. Xu, A. Cui, and Y . Jiang, “Vulnerability detection of ics protocols via cross-state fuzzing,”TCAD, 2022
work page 2022
-
[16]
Gcovr: A report generator for gcc’s gcov,
Gcovr Developers, “Gcovr: A report generator for gcc’s gcov,” https: //gcovr.com/en/stable/, version 8.4, accessed September 2025
work page 2025
-
[17]
F. E. Solution, “Iec-60870-5-104,” Website. https://github.com/ FreyrSCADA/IEC-60870-5-104
-
[18]
Ipspex: Enabling efficient fuzzing via specification extraction on ics protocol,
Y . Sun, S. Lv, J. You, Y . Sun, X. Chen, Y . Zheng, and L. Sun, “Ipspex: Enabling efficient fuzzing via specification extraction on ics protocol,” inACANC, 2022
work page 2022
-
[19]
{TCP-Fuzz}: Detecting memory and semantic bugs in {TCP} stacks with fuzzing,
Y .-H. Zou, J.-J. Bai, J. Zhou, J. Tan, C. Qin, and S.-M. Hu, “{TCP-Fuzz}: Detecting memory and semantic bugs in {TCP} stacks with fuzzing,” in USENIX ATC 21, 2021
work page 2021
-
[20]
D. Wang, G. Zhou, L. Chen, D. Li, and Y . Miao, “Prophetfuzz: Fully automated prediction and fuzzing of high-risk option combinations with only documentation via large language model,” inSIGSAC, 2024
work page 2024
-
[21]
M. M. Rahman, I. Karim, and E. Bertino, “Cellularlint: A systematic ap- proach to identify inconsistent behavior in cellular network specifications,” inUSENIX Security, 2024. TABLE IV SUMMARY OF MEMORY SAFETY VULNERABILITIES DISCOVERED BYAFL-ICP ID Subject Version Memory Safety Bug Description 1 FreyrSCADA V21.06.008-89-g917706d Memory leaks after memory ...
work page 2024
-
[22]
Fuzz4all: Universal fuzzing with large language models,
C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang, “Fuzz4all: Universal fuzzing with large language models,” inICSE, 2024
work page 2024
-
[23]
Kernelgpt: Enhanced kernel fuzzing via large language models,
C. Yang, Z. Zhao, and L. Zhang, “Kernelgpt: Enhanced kernel fuzzing via large language models,” inASPLOS, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.