pith. sign in

arxiv: 2605.04760 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.NI· cs.SE

AFL-ICP: Enhancing Industrial Control Protocol Reliability via Specification-Guided Fuzzing

Pith reviewed 2026-05-08 17:43 UTC · model grok-4.3

classification 💻 cs.CR cs.NIcs.SE
keywords industrial control protocolsspecification-guided fuzzingLLM-assisted testingsemantic bug detectionprotocol securitydifferential checkingAFLvulnerability discovery
0
0 comments X

The pith

AFL-ICP formalizes industrial control protocol specifications into executable grammars and uses LLMs to guide fuzzing and detect semantic bugs missed by existing tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that observation-driven fuzzers cannot reach deep states or spot subtle logic errors in industrial control protocols because they lack direct access to formal specifications. AFL-ICP instead builds a pipeline that converts specifications into machine-executable grammars, then applies LLMs for protocol adaptation, seed creation, and differential checking of implementation behavior against the spec. This setup is meant to expose semantic deviations that silently affect availability and operations. A sympathetic reader would care because these protocols run critical infrastructure where undetected bugs carry real safety and reliability risks. The authors report higher coverage on four tested protocols and 24 previously unknown vulnerabilities, sixteen of them semantic or logic bugs.

Core claim

AFL-ICP pioneers a specification-driven paradigm for fuzzing industrial control protocols. It features a context-aware specification formalization pipeline to transform complex specifications into rigorous machine-executable grammars. Building on this formalized specification, AFL-ICP leverages LLMs to enable automated protocol adaptation and seed generation, allowing for rapid extension to new protocols with minimal manual effort. Additionally, it includes an LLM-powered differential checker that cross-references implementation outputs with specification requirements to detect subtle semantic and logic bugs that existing fuzzers cannot detect.

What carries the argument

The context-aware specification formalization pipeline that produces machine-executable grammars from protocol specifications, paired with an LLM-powered differential checker that compares run-time outputs to those grammars to flag semantic deviations.

If this is right

  • The LLM automation allows new protocols to be supported with far less manual grammar writing than traditional fuzzers require.
  • The differential checker finds 16 semantic and logic bugs that can disrupt industrial operations without obvious crashes.
  • Overall coverage exceeds that of state-of-the-art observation-driven fuzzers on both open- and closed-source ICP implementations.
  • Vendor acknowledgments for the 24 vulnerabilities indicate the bugs are reproducible enough to warrant fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same formalization-plus-differential-check approach could be applied to other specification-rich domains such as automotive bus protocols or medical device interfaces.
  • Generated grammars could serve as the basis for model-based test generation beyond pure fuzzing, for example to prove absence of certain state transitions.
  • If the LLM formalization step proves stable across model versions, the framework could lower the barrier for smaller vendors to perform rigorous protocol testing.

Load-bearing premise

The LLM components produce accurate grammars and trustworthy bug reports without systematic false positives or missed vulnerabilities that would erase the reported coverage gains.

What would settle it

An independent team re-runs AFL-ICP on the same four protocols, manually verifies the 24 reported vulnerabilities, and measures whether coverage and bug counts remain higher than baseline fuzzers when the LLM formalization or checker is disabled.

Figures

Figures reproduced from arXiv: 2605.04760 by Jiaying Meng, Ke Xu, Min Liu, Qi Li, Xuewei Feng.

Figure 1
Figure 1. Figure 1: Comparison between the traditional fuzzing and the view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of AFL-ICP: end-to-end workflow from multimodal specification formalization to LLM-assisted vulnerability reasoning. visual constraints are indispensable for reconstructing message formats and state machine topology view at source ↗
Figure 3
Figure 3. Figure 3: The Multimodal Layout Reconstruction Pipeline view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study results showing state, transition, and line coverage improvements. view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study results showing branch coverage improvements view at source ↗
read the original abstract

Industrial Control Protocols (ICPs) are critical to the reliability and stability of industrial infrastructure, yet their security is fundamentally compromised by a specification-blindness bottleneck. Modern fuzzers, constrained by observation-driven inference, struggle to penetrate deep protocol states or detect subtle semantic deviations. In this paper, we present AFL-ICP, an autonomous fuzzing framework that pioneers a specification-driven paradigm. AFL-ICP features a context-aware specification formalization pipeline to transform complex specifications into rigorous machine-executable grammars. Building on this formalized specification, AFL-ICP leverages LLMs to enable automated protocol adaptation and seed generation, allowing for rapid extension to new protocols with minimal manual effort. Additionally, it includes an LLM-powered differential checker that cross-references implementation outputs with specification requirements to detect subtle semantic and logic bugs that existing fuzzers cannot detect. We implement AFL-ICP and evaluate it on four widely used ICPs, including both open-source and closed-source variants. Results show that AFL-ICP significantly outperforms state-of-the-art fuzzers in coverage and uncovers 24 previously unknown vulnerabilities, for which we have received acknowledgments from affected vendors (e.g., FreyrSCADA). Specifically, the identified vulnerabilities include 16 semantic and logic bugs that can silently disrupt industrial operations and degrade service availability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents AFL-ICP, a specification-guided fuzzing framework for Industrial Control Protocols that uses a context-aware LLM pipeline to convert complex specs into executable grammars, automates protocol adaptation and seed generation, and deploys an LLM-based differential checker to identify semantic and logic bugs beyond what observation-driven fuzzers can reach. Evaluation on four ICPs (open- and closed-source) claims substantially higher coverage than state-of-the-art fuzzers plus discovery of 24 previously unknown vulnerabilities (16 semantic/logic), supported by vendor acknowledgments such as from FreyrSCADA.

Significance. If the central performance and bug-finding claims are substantiated, the work would meaningfully advance automated security analysis of critical infrastructure protocols by addressing the specification-blindness limitation of existing fuzzers. The combination of LLM-driven formalization, low-manual-effort adaptation, and semantic deviation detection targets a practically important gap; reproducible artifacts or quantified validation of the LLM components would further strengthen its contribution.

major comments (3)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the reported coverage gains and 24-vulnerability count rest on the LLM formalization and differential checker, yet no accuracy metrics (grammar fidelity vs. manual spec, checker precision/recall), ablation studies, or false-positive/negative rates are supplied, leaving the performance delta and semantic-bug validity difficult to assess.
  2. [Evaluation] Evaluation section: concrete results are stated for four protocols, but baseline fuzzer configurations, coverage measurement methodology (e.g., which states/messages counted), seed-selection strategy, and run-time parameters are not detailed, preventing reproduction or attribution of gains specifically to the specification-guided components.
  3. [Differential Checker] LLM-powered differential checker description: the claim that it reliably distinguishes true semantic/logic deviations from noise is load-bearing for the 16 semantic bugs reported; without an independent validation set or quantified error analysis, the bug counts risk inflation from checker hallucinations or over-sensitivity.
minor comments (1)
  1. [Methodology] The architecture diagram and pipeline description would benefit from explicit call-outs of which LLM prompts or post-processing steps enforce grammar constraints, to clarify how hallucinations are mitigated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important areas where additional details and validation can strengthen the presentation of our results. We address each major comment below and will incorporate the suggested improvements in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the reported coverage gains and 24-vulnerability count rest on the LLM formalization and differential checker, yet no accuracy metrics (grammar fidelity vs. manual spec, checker precision/recall), ablation studies, or false-positive/negative rates are supplied, leaving the performance delta and semantic-bug validity difficult to assess.

    Authors: We agree that explicit accuracy metrics and ablation studies would improve the ability to assess the contributions of the LLM components. In the revised manuscript we will add a new subsection in the Evaluation section that reports (1) grammar fidelity measured by comparing a sample of LLM-generated grammars against manually authored reference grammars, (2) precision/recall/F1 of the differential checker on a held-out validation set of labeled deviations, and (3) ablation results that isolate the impact of the specification formalization pipeline and the differential checker on both coverage and bug discovery. These additions will be supported by data we have already collected from our experimental runs. revision: yes

  2. Referee: [Evaluation] Evaluation section: concrete results are stated for four protocols, but baseline fuzzer configurations, coverage measurement methodology (e.g., which states/messages counted), seed-selection strategy, and run-time parameters are not detailed, preventing reproduction or attribution of gains specifically to the specification-guided components.

    Authors: We acknowledge that the current Evaluation section lacks sufficient implementation-level detail for full reproducibility. In the revision we will expand this section to specify: the exact configurations and command-line parameters used for each baseline fuzzer, the coverage instrumentation and measurement method (branch coverage via LLVM and the precise definition of protocol states/messages counted), the seed-selection and mutation strategy (initial seeds derived directly from the formalized grammar with specification-constrained mutations), and all run-time parameters (duration, hardware platform, random-seed settings). These details will make it possible to attribute performance differences to the specification-guided elements and to reproduce the experiments. revision: yes

  3. Referee: [Differential Checker] LLM-powered differential checker description: the claim that it reliably distinguishes true semantic/logic deviations from noise is load-bearing for the 16 semantic bugs reported; without an independent validation set or quantified error analysis, the bug counts risk inflation from checker hallucinations or over-sensitivity.

    Authors: We recognize that the reliability of the differential checker is central to the validity of the 16 semantic and logic bugs. In the revised version we will add an independent validation set of 50 manually labeled deviation cases (balanced between true semantic issues and non-buggy outputs) and report precision, recall, and F1-score for the checker. We will also describe our mitigation approach (multiple LLM queries with majority voting plus targeted human review of flagged cases) and discuss the observed error modes. This analysis will be placed in the Differential Checker subsection and will provide quantitative support for the reported bug counts. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical tool evaluation

full rationale

The paper presents AFL-ICP as an empirical fuzzing framework relying on LLM-based specification formalization, seed generation, and differential checking. All central claims (superior coverage, 24 new vulnerabilities) are supported by direct experimental measurements on four ICPs against SOTA baselines, with external vendor acknowledgments. No equations, predictions, or derivations exist that reduce results to fitted inputs, self-definitions, or self-citation chains. The evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the assumption that LLMs can faithfully translate natural-language specifications into executable grammars and that differential outputs reliably indicate semantic bugs; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5538 in / 1150 out tokens · 37990 ms · 2026-05-08T17:43:59.767344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    The global state of security in industrial control systems: An empirical analysis of vulnerabilities around the world,

    S. D. D. Anton, D. Fraunholz, D. Krohmer, D. Reti, D. Schneider, and H. D. Schotten, “The global state of security in industrial control systems: An empirical analysis of vulnerabilities around the world,”IoTJ, 2021

  2. [2]

    Aflnet: a greybox fuzzer for network protocols,

    V .-T. Pham, M. B¨ohme, and A. Roychoudhury, “Aflnet: a greybox fuzzer for network protocols,” inICST, 2020

  3. [3]

    Stateafl: Greybox fuzzing for stateful network servers,

    R. Natella, “Stateafl: Greybox fuzzing for stateful network servers,” Empirical Software Engineering, 2022

  4. [4]

    Large language model guided protocol fuzzing,

    R. Meng, M. Mirchev, M. B¨ohme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inNDSS, 2024

  5. [5]

    Evaluating fuzz testing,

    G. Klees, A. Ruef, B. Cooper, S. Wei, and M. Hicks, “Evaluating fuzz testing,” inSIGSAC, 2018

  6. [6]

    Seed selection for successful fuzzing,

    A. Herrera, H. Gunadi, S. Magrath, M. Norrish, M. Payer, and A. L. Hosking, “Seed selection for successful fuzzing,” inISSTA, 2021

  7. [7]

    Undefined behavior sanitizer for chromium,

    T. C. Projects, “Undefined behavior sanitizer for chromium,” Website. http://www.chromium.org/developers/testing/undefinedbehaviorsanitizer, 2014

  8. [8]

    {AddressSanitizer}: A fast address sanity checker,

    K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, “{AddressSanitizer}: A fast address sanity checker,” inUSENIX ATC, 2012

  9. [9]

    Sok: Sanitizing for security,

    D. Song, J. Lettner, P. Rajasekaran, Y . Na, S. V olckaert, P. Larsen, and M. Franz, “Sok: Sanitizing for security,” inS&P, 2019

  10. [10]

    Mistral OCR,

    Mistral AI, “Mistral OCR,” https://mistral.ai, 2025, accessed: 2025-03-12

  11. [11]

    Hello GPT-4o,

    OpenAI, “Hello GPT-4o,” 2024, accessed: 2025-03-12. [Online]. Available: https://openai.com/index/hello-gpt-4o/

  12. [12]

    Eipscanner,

    A. R. Aleksy Timin, “Eipscanner,” Website. https://github.com/ nimbuscontrols/EIPScanner

  13. [13]

    Polar: Function code aware fuzz testing of ics protocol,

    Z. Luo, F. Zuo, Y . Jiang, J. Gao, X. Jiao, and J. Sun, “Polar: Function code aware fuzz testing of ics protocol,”TECS, 2019

  14. [14]

    Ics protocol fuzzing: Coverage guided packet crack and generation,

    Z. Luo, F. Zuo, Y . Shen, X. Jiao, W. Chang, and Y . Jiang, “Ics protocol fuzzing: Coverage guided packet crack and generation,” inDAC, 2020

  15. [15]

    Vulnerability detection of ics protocols via cross-state fuzzing,

    F. Zuo, Z. Luo, J. Yu, T. Chen, Z. Xu, A. Cui, and Y . Jiang, “Vulnerability detection of ics protocols via cross-state fuzzing,”TCAD, 2022

  16. [16]

    Gcovr: A report generator for gcc’s gcov,

    Gcovr Developers, “Gcovr: A report generator for gcc’s gcov,” https: //gcovr.com/en/stable/, version 8.4, accessed September 2025

  17. [17]

    Iec-60870-5-104,

    F. E. Solution, “Iec-60870-5-104,” Website. https://github.com/ FreyrSCADA/IEC-60870-5-104

  18. [18]

    Ipspex: Enabling efficient fuzzing via specification extraction on ics protocol,

    Y . Sun, S. Lv, J. You, Y . Sun, X. Chen, Y . Zheng, and L. Sun, “Ipspex: Enabling efficient fuzzing via specification extraction on ics protocol,” inACANC, 2022

  19. [19]

    {TCP-Fuzz}: Detecting memory and semantic bugs in {TCP} stacks with fuzzing,

    Y .-H. Zou, J.-J. Bai, J. Zhou, J. Tan, C. Qin, and S.-M. Hu, “{TCP-Fuzz}: Detecting memory and semantic bugs in {TCP} stacks with fuzzing,” in USENIX ATC 21, 2021

  20. [20]

    Prophetfuzz: Fully automated prediction and fuzzing of high-risk option combinations with only documentation via large language model,

    D. Wang, G. Zhou, L. Chen, D. Li, and Y . Miao, “Prophetfuzz: Fully automated prediction and fuzzing of high-risk option combinations with only documentation via large language model,” inSIGSAC, 2024

  21. [21]

    Cellularlint: A systematic ap- proach to identify inconsistent behavior in cellular network specifications,

    M. M. Rahman, I. Karim, and E. Bertino, “Cellularlint: A systematic ap- proach to identify inconsistent behavior in cellular network specifications,” inUSENIX Security, 2024. TABLE IV SUMMARY OF MEMORY SAFETY VULNERABILITIES DISCOVERED BYAFL-ICP ID Subject Version Memory Safety Bug Description 1 FreyrSCADA V21.06.008-89-g917706d Memory leaks after memory ...

  22. [22]

    Fuzz4all: Universal fuzzing with large language models,

    C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang, “Fuzz4all: Universal fuzzing with large language models,” inICSE, 2024

  23. [23]

    Kernelgpt: Enhanced kernel fuzzing via large language models,

    C. Yang, Z. Zhao, and L. Zhang, “Kernelgpt: Enhanced kernel fuzzing via large language models,” inASPLOS, 2025