pith. sign in

arxiv: 2606.23464 · v1 · pith:2PPNIHBRnew · submitted 2026-06-22 · 💻 cs.SE · cs.CR

An Automated Framework for Input Alphabet Construction in Stateful Protocol Implementation Learning

Pith reviewed 2026-06-26 07:21 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords state machine learninginput alphabet constructionprotocol implementationslarge language modelssemantic bugsincremental learningautomated testing
0
0 comments X

The pith

Large language models can automatically construct input alphabets for state machine learning of stateful protocols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to remove the need for manually defined input alphabets in state machine learning of protocol implementations. It proposes using large language models to analyze message structures and generate mutated input symbols that include both valid and invalid cases. This automation allows the learning process to explore a broader space and detect semantic defects that handcrafted alphabets miss. A mini-batch strategy manages the increased computational cost from larger alphabets. Experiments on real protocol stacks show it can find both known and new vulnerabilities, some of which developers have fixed.

Core claim

The central claim is that an automated framework employing large language models to parse protocol message layouts and generate candidate input symbols through structured mutation rules can break the bottleneck of handcrafted alphabets in stateful protocol learning, while a mini-batch incremental learning strategy controls the overhead, ultimately reproducing existing security vulnerabilities and identifying novel semantic bugs in practical implementations.

What carries the argument

LLM-based generation of input symbols using structured mutation rules on parsed layouts, which automatically covers valid and invalid message spaces.

If this is right

  • State machine learning can now include anomalous non-conformant messages in its exploration.
  • Manual protocol expertise is no longer required for defining input alphabets.
  • Overhead from growing alphabets is mitigated through reuse of existing learned automata.
  • New semantic bugs in protocol stacks can be discovered and reported for patching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could extend to learning models for other stateful systems beyond network protocols.
  • Improvements in language model accuracy might further increase the completeness of the generated alphabets.
  • Integration with existing fuzzing tools could enhance bug detection rates in security testing workflows.

Load-bearing premise

Large language models can reliably parse protocol message layouts and produce effective candidate input symbols following structured mutation rules that cover valid and invalid spaces.

What would settle it

A test showing that the LLM-generated alphabets fail to reproduce any of the known security vulnerabilities in the evaluated protocol stacks would disprove the effectiveness claim.

Figures

Figures reproduced from arXiv: 2606.23464 by JiongHan Wang, Wenchao Huang.

Figure 1
Figure 1. Figure 1: CVE-2022-25638, Vulnerable State Machine [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework Overview expected, and then follow it with an invalid certificate verifica￾tion message (containing an unknown signature algorithm and arbitrary payload), thereby bypassing server authentication. The core condition for triggering this vulnerability is to introduce an Empty Certificate message (EmptyCert) into the message flow to￾gether with a carefully crafted invalid CertificateVerify message (C… view at source ↗
Figure 4
Figure 4. Figure 4: Single-Symbol Mutation Priority alphabet, thereby forming a new alphabet. Then, iterate through all the remaining symbols in the alphabet that have not yet been mu￾tated and repeat the process. Implementation details are provided in Algorithm 1 in the appendix. Taking the FTP protocol as an example, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mini-Batch Learning Strategy  {TYPE,TYPE_Payload_null, TYPE_Payload_eunm,…} {USER, USER_Payload_null, USER_Payload_replace,…} {PASS, PASS_Payload_null, PASS_Payload_ replace,…} … … Basic Alphabet Extended Input Symbol Set divided by message type Set of Extended Alphabets 1= {USER,PASS,PASV,LIST,QUIT TYPE,TYPE_Payload_null, TYPE_Payload_eunm,…} 0= {USER,PASS, PASV,LIST,QUIT} 2= {USER,PASS,PASV,LIST, ,QU… view at source ↗
Figure 6
Figure 6. Figure 6: CVE-2021-3336, Vulnerable State Machine namely EmptyCert and CV_invalid. Both messages can be auto￾matically generated through mutation under the proposed frame￾work. Equipped with the designed Diverse-Symbol Mutation Prior￾ity strategy, our approach reliably detects the vulnerability within 12 hours across all five repeated experiments. From a probabilistic perspective, even adopting a purely random mutat… view at source ↗
read the original abstract

As a prevalent analytical technique for stateful protocol implementations, state machine learning suffers from a core bottleneck stemming from handcrafted input alphabets. Manual alphabet definition inherently limits the completeness of input exploration, making it difficult to capture anomalous non-conformant messages and consequently missing latent semantic defects. In this paper, we target automatic input alphabet generation to break the above limitation for state machine learning. We adopt large language models to parse protocol message layouts and produce candidate input symbols following structured mutation rules, which automatically covers valid and invalid message spaces and eliminates reliance on manual protocol expertise. Considering the rising overhead brought by continuously growing alphabets, we introduce a mini-batch incremental learning strategy to reuse existing learned automata when incorporating new alphabet entries. Comprehensive experiments on practical protocol stacks indicate our approach can reproduce existing security vulnerabilities and identify novel semantic bugs. A subset of these newly discovered issues has been confirmed and patched by developers, proving the practicability and effectiveness of our proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents an automated framework for input alphabet construction in state machine learning of stateful protocol implementations. It uses large language models to parse protocol message layouts and generate candidate symbols via structured mutation rules that aim to cover both valid and invalid message spaces, thereby removing the need for manual protocol expertise. To manage the overhead of growing alphabets, a mini-batch incremental learning strategy reuses previously learned automata. Experiments on practical protocol stacks are reported to reproduce known security vulnerabilities and discover novel semantic bugs, with some confirmed and patched by developers.

Significance. If the LLM-based alphabet generation can be shown to produce sufficiently complete and correct input sets without hidden manual intervention, the work would address a key practical bottleneck in protocol state machine learning, enabling broader automated exploration of implementation behaviors and potentially improving detection of semantic defects in deployed systems. The incremental learning component is a pragmatic contribution to scalability. The reported bug findings, if reproducible and attributable to the automation, would strengthen the case for the method's effectiveness in real-world security analysis.

major comments (2)
  1. [Abstract] Abstract: The central claim that the approach 'automatically covers valid and invalid message spaces and eliminates reliance on manual protocol expertise' is load-bearing for the contribution, yet the manuscript provides no quantitative validation of the LLM outputs (e.g., coverage statistics for valid formats, comparison of generated vs. expert alphabets on the same stacks, or error rates for nonsensical/malformed symbols). Without these metrics, it is unclear whether the reported bug discoveries result from the automated method or from unstated curation of LLM results.
  2. [Experiments] Experiments section: The abstract states that 'comprehensive experiments... indicate our approach can reproduce existing security vulnerabilities and identify novel semantic bugs,' but reports no details on experimental controls, statistical significance of findings, or ablation studies isolating the effect of the LLM-generated alphabet versus the incremental learner. This undermines attribution of the results to the proposed automation.
minor comments (1)
  1. [Method] The description of the structured mutation rules would benefit from an explicit enumeration or pseudocode example to clarify how valid and invalid spaces are targeted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validation and experimental rigor. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the approach 'automatically covers valid and invalid message spaces and eliminates reliance on manual protocol expertise' is load-bearing for the contribution, yet the manuscript provides no quantitative validation of the LLM outputs (e.g., coverage statistics for valid formats, comparison of generated vs. expert alphabets on the same stacks, or error rates for nonsensical/malformed symbols). Without these metrics, it is unclear whether the reported bug discoveries result from the automated method or from unstated curation of LLM results.

    Authors: We agree that direct quantitative metrics on LLM output quality would strengthen the central claim. The manuscript grounds the claim in the use of structured mutation rules applied to LLM-parsed layouts, which are designed to systematically generate both valid and invalid symbols without manual protocol expertise. Bug discoveries serve as an indirect demonstration of effectiveness, with several issues confirmed and patched by developers. No unstated curation occurred. In revision we will add coverage statistics, a comparison against expert alphabets for at least one protocol, and error-rate reporting on generated symbols. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract states that 'comprehensive experiments... indicate our approach can reproduce existing security vulnerabilities and identify novel semantic bugs,' but reports no details on experimental controls, statistical significance of findings, or ablation studies isolating the effect of the LLM-generated alphabet versus the incremental learner. This undermines attribution of the results to the proposed automation.

    Authors: We acknowledge that the experiments section lacks explicit controls, statistical tests, and ablations separating the LLM alphabet component from the incremental learner. The reported results are produced by the combined framework on real protocol stacks. In the revised version we will expand the experiments section with a detailed description of the experimental setup, ablation studies (full method vs. manual alphabets and vs. non-incremental learning), and any applicable statistical measures to better attribute outcomes to the automation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential fits

full rationale

The paper presents an engineering framework that uses LLMs to generate protocol input alphabets, followed by incremental learning experiments on real stacks. The provided abstract and description contain no equations, no fitted parameters renamed as predictions, no uniqueness theorems, and no self-citation chains that bear the central claim. Claims rest on experimental reproduction of bugs rather than any closed derivation that reduces to its own inputs by construction. This is the expected non-finding for an empirical contribution without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; the central claim rests on the unstated effectiveness of LLM parsing and mutation rules.

pith-pipeline@v0.9.1-grok · 5690 in / 1133 out tokens · 24427 ms · 2026-06-26T07:21:13.582415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references

  1. [1]

    Max Ammann, Lucca Hirschi, and Steve Kremer. 2024. DY fuzzing: formal Dolev- Yao models meet cryptographic protocol fuzz testing. In 2024 IEEE Symposium on Security and Privacy (SP) . IEEE, 1481–1499

  2. [2]

    Dana Angluin. 1987. Learning regular sets from queries and counterexamples. Information and computation 75, 2 (1987), 87–106

  3. [3]

    Linard Arquint, Malte Schwerhoff, Vaibhav Mehta, and Peter Müller. 2023. A generic methodology for the modular verification of security protocol imple- mentations. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1377–1391

  4. [4]

    Linard Arquint, Felix A Wolf, Joseph Lallemand, Ralf Sasse, Christoph Sprenger, Sven N Wiesner, David Basin, and Peter Müller. 2023. Sound verification of security protocols: From design to interoperable implementations. In 2023 IEEE Symposium on Security and Privacy (SP) . IEEE, 1077–1093

  5. [5]

    Cornelius Aschermann, Tommaso Frassetto, Thorsten Holz, Patrick Jauernig, Ahmad-Reza Sadeghi, and Daniel Teuchert. 2019. NAUTILUS: Fishing for deep bugs with grammars.. In NDSS, Vol. 19. 337

  6. [6]

    Fabian Bäumer, Marcel Maehren, Marcus Brinkmann, and Jörg Schwenk. 2025. Finding ssh strict key exchange violations by state learning. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security . 246– 260

  7. [7]

    Benjamin Beurdouche, Karthikeyan Bhargavan, Antoine Delignat-Lavaud, Cé- dric Fournet, Markulf Kohlweiss, Alfredo Pironti, Pierre-Yves Strub, and Jean Karim Zinzindohoue. 2017. A messy state of the union: Taming the com- posite state machines of TLS. Commun. ACM 60, 2 (2017), 99–107

  8. [8]

    Joeri De Ruiter and Erik Poll. 2015. Protocol state fuzzing of {TLS} implementa- tions. In 24th USENIX Security Symposium (USENIX Security 15) . 193–206

  9. [9]

    Tiago Ferreira, Harrison Brewton, Loris D’Antoni, and Alexandra Silva. 2021. Prognosis: closed-box analysis of network protocol implementations. In Proceed- ings of the 2021 ACM SIGCOMM 2021 Conference . 762–774

  10. [10]

    Paul Fiterau-Brostean, Bengt Jonsson, Robert Merget, Joeri De Ruiter, Konstanti- nos Sagonas, and Juraj Somorovsky. 2020. Analysis of {DTLS} implementations using protocol state fuzzing. In 29th USENIX Security Symposium (USENIX Secu- rity 20). 2523–2540

  11. [11]

    Paul Fiterau-Brostean, Bengt Jonsson, Konstantinos Sagonas, and Fredrik Tåquist. 2023. Automata-Based Automated Detection of State Machine Bugs in Protocol Implementations.. In NDSS

  12. [12]

    Falk M Howar. 2012. Active learning of interface programs. (2012)

  13. [13]

    Malte Isberner, Falk Howar, and Bernhard Steffen. 2014. The TTT algorithm: a redundancy-free approach to active automata learning. In International Confer- ence on Runtime Verification. Springer, 307–322

  14. [14]

    Malte Isberner, Falk Howar, and Bernhard Steffen. 2015. The open-source learn- lib: a framework for active automata learning. In International Conference on Computer Aided Verification. Springer, 487–495

  15. [15]

    Kunpeng Jian, Yanyan Zou, Yeting Li, Jialun Cao, Menghao Li, Jian Sun, Jingyi Shi, and Wei Huo. 2024. Fuzzing for Stateful Protocol Implementations: Are We There Yet?. In International Symposium on Theoretical Aspects of Software Engineering. Springer, 186–204

  16. [16]

    Marcel Maehren, Nurullah Erinola, Robert Merget, Jörg Schwenk, and Juraj So- morovsky. 2025. Towards {Internet-Based} State Learning of {TLS} State Ma- chines. In 34th USENIX Security Symposium (USENIX Security 25) . 7097–7116

  17. [17]

    Chris McMahon Stone, Sam L Thomas, Mathy Vanhoef, James Henderson, Nico- las Bailluet, and Tom Chothia. 2022. The closer you look, the more you learn: A grey-box approach to protocol state machine learning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 2265–2278

  18. [18]

    Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Large Language Model guided Protocol Fuzzing.. In NDSS

  19. [19]

    Roberto Natella and Van-Thuan Pham. 2021. Profuzzbench: A benchmark for stateful protocol fuzzing. In Proceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis . 662–665

  20. [20]

    Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2020. Aflnet: A greybox fuzzer for network protocols. In 2020 IEEE 13th international conference on software testing, validation and verification (ICST) . IEEE, 460–465

  21. [21]

    Shisong Qin, Fan Hu, Zheyu Ma, Bodong Zhao, Tingting Yin, and Chao Zhang

  22. [22]

    ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–26

    Nsfuzz: Towards efficient and state-aware network service fuzzing. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–26

  23. [23]

    Aina Toky Rasoamanana, Olivier Levillain, and Hervé Debar. 2022. Towards a systematic and automatic use of state machine inference to uncover security flaws and fingerprint TLS stacks. InEuropean symposium on research in computer security. Springer, 637–657

  24. [24]

    Eric Rescorla. 2018. The Transport Layer Security (TLS) Protocol Version 1.3. RFC 8446. https://www.rfc-editor.org/rfc/rfc8446

  25. [25]

    Sergej Schumilo, Cornelius Aschermann, Andrea Jemmett, Ali Abbasi, and Thorsten Holz. 2022. Nyx-net: network fuzzing with incremental snapshots. In Proceedings of the seventeenth european conference on computer systems. 166–180

  26. [26]

    secdev, Guillaume Potter, and the Scapy Contributors. 2026. Scapy. https:// github.com/secdev/scapy

  27. [27]

    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. {AddressSanitizer}: A fast address sanity checker. In 2012 USENIX annual technical conference (USENIX ATC 12) . 309–318

  28. [28]

    Arthur Tran Van, Olivier Levillain, and Herve Debar. 2024. Mealy verifier: An au- tomated, exhaustive, and explainable methodology for analyzing state machines in protocol implementations. In Proceedings of the 19th International Conference on A vailability, Reliability and Security. 1–10

  29. [29]

    Jules van Thoor, Joeri de Ruiter, and Erik Poll. 2018. Learning state machines of TLS 1.3 implementations. Bachelor thesis. Radboud University (2018), 96

  30. [30]

    Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. 2019. Superion: Grammar- aware greybox fuzzing. In 2019 IEEE/ACM 41st International Conference on Soft- ware Engineering (ICSE). IEEE, 724–735