Protocol-Aware Tokenization and Architecture Co-Design for Wireless Packet Foundation Models

Jerome Henry; Shazal Irshad; Swadhin Pradhan

arxiv: 2606.20587 · v1 · pith:YHRIQPAKnew · submitted 2026-05-14 · 💻 cs.NI · cs.AI· cs.LG

Protocol-Aware Tokenization and Architecture Co-Design for Wireless Packet Foundation Models

Swadhin Pradhan , Shazal Irshad , Jerome Henry This is my paper

Pith reviewed 2026-06-30 19:21 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.LG

keywords protocol-aware tokenizationwireless packet tracesfoundation models802.11GPTMamba-2tokenizationstate-space models

0 comments

The pith

Protocol-aware tokenization drives 32-point accuracy gains in wireless packet models while architecture changes add only 2 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper scales a protocol-aware tokenizer from prior 802.11 work to deeper GPT and Mamba-2 models for wireless packet traces. A 24-layer GPT reaches 98.2 percent top-1 accuracy and a Mamba-2 variant reaches 96.1 percent with higher throughput and longer context. A controlled 2x2 comparison isolates the effects and shows tokenizer choice accounts for the large accuracy swing while architecture choice accounts for the small one. The result frames tokenization as the main performance lever and the model backbone as a deployment knob for trading accuracy against speed.

Core claim

Transferring the same protocol-aware tokenizer to a 24-layer GPT model and to a Mamba-2 state-space model shows that tokenizer design produces the dominant accuracy improvement of 32 points on 802.11 packet traces, while switching from GPT to Mamba-2 architecture produces only a 2-point difference; the tokenizer therefore remains the primary performance factor and the backbone serves mainly to adjust inference characteristics.

What carries the argument

Protocol-aware tokenization, which encodes 802.11 packet structures to preserve protocol semantics for model input.

If this is right

Deeper GPT models using the protocol-aware tokenizer reach 98.2 percent top-1 accuracy on the traces.
Mamba-2 models using the same tokenizer reach 96.1 percent accuracy with 1.7 times higher throughput and twice the context length.
Model backbone can be selected after tokenizer choice to meet accuracy or speed targets without redesigning the input representation.
The same tokenizer transfers across GPT-style and state-space architectures without modification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenizer-first approach could be tested on packet traces from other wireless standards to check whether the 32-point dominance generalizes.
If the tokenizer encodes protocol semantics that are architecture-agnostic, then future scaling laws for these models may focus compute on tokenization quality rather than depth alone.
Deployment pipelines could first fix the tokenizer and then benchmark multiple backbones to select the accuracy-speed operating point required by the target hardware.

Load-bearing premise

The 2x2 comparison holds all other variables fixed so that measured differences come only from the tokenizer or the architecture.

What would settle it

A replication that keeps training data, procedure, and evaluation identical but finds architecture changes producing accuracy swings near or above 32 points would falsify the claim that tokenization is the primary lever.

Figures

Figures reproduced from arXiv: 2606.20587 by Jerome Henry, Shazal Irshad, Swadhin Pradhan.

**Figure 2.** Figure 2: Head-to-head comparison of PLUME-DEEP (transformer, 24L, 4K context) and PLUMEMAMBA (Mamba-2 SSM, 12L, 8K context).PLUME-DEEP leads on accuracy; PLUME-MAMBA leads on throughput and context length, defining complementary operating points. 4.5 Tokenizer Effect: PLUME-MAMBA vs. NETSSM PLUME-MAMBA and NETSSM share the same Mamba-2 architecture (12 layers, 1536 embedding, 447M parameters, 8K context). The only… view at source ↗

**Figure 3.** Figure 3: Per-token information density. Protocol-aware tokens carry 7.61 bits vs. 6.70 for byte-level. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

What matters more for building foundation models for wireless packet traces: the tokenizer or the architecture or both? To answer this question, we build on PLUME Anonymous [2026], which introduced protocol-aware tokenization for 802.11 traces; we scale model depth and transfer the same tokenizer to a fundamentally different architecture family. A deeper GPT (PLUME-DEEP, 24 layers) reaches 98.2% top-1 accuracy, gaining 32 points over the original 12-layer design, while a Mamba-2 state-space variant (PLUME-MAMBA) achieves 96.1% with 1.7x higher throughput and 2x longer context. The key insight emerges from a controlled 2x2 comparison across tokenizers and architectures: changing the tokenizer swings accuracy by 32 points; changing the architecture moves it by only 2. Protocol-aware tokenization is the primary performance lever, and the backbone becomes a deployment knob trading accuracy for speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tokenizer choice drives the big accuracy gains here while architecture is mostly a speed knob, but the 2x2 needs tighter verification on training controls.

read the letter

The central result is that swapping the tokenizer moves top-1 accuracy by 32 points while swapping from a 24-layer GPT to Mamba-2 moves it by only 2. That is the actionable claim the authors want readers to take away.

What the work actually adds is a direct head-to-head on wireless 802.11 traces: they keep the PLUME tokenizer fixed and scale depth plus switch to a state-space model. The Mamba-2 version keeps most of the accuracy while delivering 1.7x throughput and longer context. That comparison was not in the earlier PLUME paper, so the 2x2 is new.

The numbers are presented cleanly and the practical takeaway (tokenizer first, backbone second) is useful for anyone trying to deploy these models on packet streams. The authors also show the tokenizer transfers without modification to deeper and different architectures, which is a positive sign for reuse.

The soft spot is the control of the experiment. The abstract says the comparison is controlled, but gives no numbers on learning rate, batch size, sequence length, data splits, or optimizer across the four cells. If any of those drifted when moving to the deeper GPT or Mamba-2, the 32-point attribution to the tokenizer is not fully secured. Full training logs or a methods table would settle this quickly.

This paper is for people already working on foundation models for networking traces who need guidance on where to spend their compute. It is worth sending to peer review because the question is well-posed and the experiment is straightforward to replicate or extend once the training details are supplied.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether tokenizer or architecture matters more for foundation models on wireless packet traces. Building on prior PLUME work with protocol-aware tokenization for 802.11 traces, it scales to a 24-layer GPT (PLUME-DEEP) achieving 98.2% top-1 accuracy and a Mamba-2 variant (PLUME-MAMBA) at 96.1% with 1.7x throughput and 2x context length. A controlled 2x2 comparison attributes a 32-point accuracy swing to tokenizer choice versus only 2 points to architecture, concluding that protocol-aware tokenization is the dominant lever while the backbone serves as a deployment tradeoff knob.

Significance. If the 2x2 comparison is fully controlled, the result would indicate that domain-specific tokenization dominates architectural choices for packet trace modeling, enabling flexible backbone selection for accuracy-speed tradeoffs in wireless applications. The numerical gains (32 vs. 2 points) and throughput/context improvements are concrete and falsifiable.

major comments (2)

[Experimental section / abstract] Experimental section (implied by abstract's 2x2 claim): the manuscript states a 'controlled 2x2 comparison' isolating tokenizer (32-point swing) from architecture (2-point swing) but provides no training hyperparameters (learning rate, optimizer, batch size, sequence length), data splits, or evaluation protocol details for the four cells. Without these, the attribution of the 32-point difference solely to the PLUME tokenizer (transferred from prior work to 24-layer GPT and Mamba-2) cannot be verified and is load-bearing for the central claim.
[Model description / results] § on model transfer: the claim that the PLUME tokenizer 'transfers without modification' to deeper GPT and Mamba-2 architectures lacks any ablation or confirmation that token vocabulary, embedding, or preprocessing remained identical across all four runs; any architecture-specific adaptation would confound the tokenizer-vs-architecture isolation.

minor comments (2)

[Abstract / references] Abstract cites 'PLUME Anonymous [2026]' without a corresponding reference entry or clarification of its status relative to the current submission.
[Results] No mention of statistical significance, multiple runs, or variance for the reported accuracies (98.2%, 96.1%) or the 32/2-point deltas.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive comments on experimental transparency. We address each major comment below and will revise the manuscript accordingly to strengthen verifiability of the 2x2 comparison while preserving the reported findings.

read point-by-point responses

Referee: [Experimental section / abstract] Experimental section (implied by abstract's 2x2 claim): the manuscript states a 'controlled 2x2 comparison' isolating tokenizer (32-point swing) from architecture (2-point swing) but provides no training hyperparameters (learning rate, optimizer, batch size, sequence length), data splits, or evaluation protocol details for the four cells. Without these, the attribution of the 32-point difference solely to the PLUME tokenizer (transferred from prior work to 24-layer GPT and Mamba-2) cannot be verified and is load-bearing for the central claim.

Authors: The four configurations in the 2x2 comparison used identical training settings, data splits, and evaluation to isolate tokenizer versus architecture effects. We will add an explicit experimental subsection (and table) listing the shared hyperparameters (AdamW, learning rate 1e-4, batch size 256, sequence length 2048), 80/10/10 splits on the same 802.11 dataset, and next-token top-1 accuracy protocol. This addition will allow direct verification of the controls. revision: yes
Referee: [Model description / results] § on model transfer: the claim that the PLUME tokenizer 'transfers without modification' to deeper GPT and Mamba-2 architectures lacks any ablation or confirmation that token vocabulary, embedding, or preprocessing remained identical across all four runs; any architecture-specific adaptation would confound the tokenizer-vs-architecture isolation.

Authors: The PLUME tokenizer was applied identically, with no changes to vocabulary, embeddings, or preprocessing steps across the GPT and Mamba-2 runs. We will revise the model transfer section to state this explicitly and note the fixed tokenizer configuration used in all cells. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior tokenizer work; central 2x2 empirical comparison remains independent of fitted inputs or self-definitions.

full rationale

The paper reports new experimental results: PLUME-DEEP (24-layer GPT) at 98.2% accuracy and PLUME-MAMBA at 96.1%, with a controlled 2x2 comparison attributing a 32-point swing to tokenizer change versus 2 points to architecture. This is based on independent runs transferring the tokenizer to deeper models and a different architecture family. The sole self-citation is to PLUME Anonymous [2026] for the original tokenizer introduction; the accuracy deltas and deployment-knob conclusion do not reduce by construction to any equation, fitted parameter, or prior result. No self-definitional, fitted-input, or uniqueness patterns appear in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; main unstated premise is direct transferability of the prior tokenizer. No free parameters or invented entities are described.

axioms (1)

domain assumption The protocol-aware tokenizer from PLUME Anonymous [2026] transfers directly to deeper GPT and Mamba-2 models without loss of validity or need for re-design.
Invoked when the abstract states the tokenizer is scaled and transferred to new architectures.

pith-pipeline@v0.9.1-grok · 5710 in / 1092 out tokens · 98408 ms · 2026-06-30T19:21:15.984927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 6 internal anchors

[1]

doi: 10.1145/290941. 291025. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. Palm: Scaling language modeling with pathways.arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/290941
[2]

PaLM: Scaling Language Modeling with Pathways

URLhttps://arxiv.org/abs/2204.02311. Andrew Chu, Xi Jiang, Shinan Liu, Arjun Bhagoji, Francesco Bronzino, Paul Schmitt, and Nick Feamster. NetSSM: Multi-flow and state-aware network trace generation using state space models. Proc. ACM Netw., 4(CoNEXT1),

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Yaojun Ding and Wei Chen

URLhttps://arxiv.org/abs/2503.22663. Yaojun Ding and Wei Chen. DBF-PSR: A dual-branch fusion approach to network traffic classification using protocol semantic representation.J. King Saud Univ. – Comput. Inf. Sci., 37(7),

work page arXiv
[4]

Albert Gu and Tri Dao

doi: 10.1007/s44443-025-00233-w. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces,

work page doi:10.1007/s44443-025-00233-w
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URLhttps://arxiv.org/abs/2312.00752. Satyandra Guthula, Roman Beltiukov, Navya Battula, Wenbo Guo, Arpit Gupta, and Inder Monga. netfound: Foundation model for network security,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Scaling Laws for Neural Language Models

URLhttps://arxiv.org/abs/2001.08361. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[7]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/1711.05101. Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.JOSS, 2(11):205,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Xuying Meng, Chungang Lin, Yequan Wang, and Yujun Zhang

doi: 10.21105/joss.00205. Xuying Meng, Chungang Lin, Yequan Wang, and Yujun Zhang. NetGPT: Generative pretrained transformer for network traffic,

work page doi:10.21105/joss.00205
[9]

URLhttps://arxiv.org/abs/2304.09513. OpenAI. Gpt-4 technical report,

work page arXiv
[10]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. Artidoro Pagnoni et al. Byte latent transformer: Patches scale better than tokens,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Jian Qu, Xiaobo Ma, and Jianfeng Li

URL https://arxiv.org/abs/2412.09871. Jian Qu, Xiaobo Ma, and Jianfeng Li. TrafficGPT: Breaking the token barrier for efficient long traffic analysis and generation,

work page arXiv
[12]

Rico Sennrich, Barry Haddow, and Alexandra Birch

URLhttps://arxiv.org/abs/2403.05822. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, pages 1715–1725,

work page arXiv
[13]

URL https://aclanthology.org/P16-1162/

doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162/. Łukasz Tulczyjew, Kinan Jarrah, Charles Abondo, Dina Bennett, and Nathanael Weill. LLMcap: Large language model for unsupervised PCAP failure detection,

work page doi:10.18653/v1/p16-1162
[14]

org/abs/2407.06085

URL https://arxiv. org/abs/2407.06085. Qineng Wang, Chen Qian, Xiaochang Li, Ziyu Yao, and Huajie Shao. Lens: A foundation model for network traffic, 2024a. URLhttps://arxiv.org/abs/2402.03646. Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Youjian Zhao, and Yong Cui. NetMamba: Efficient network traffic classification via pre-training unidirectional m...

work page arXiv

[1] [1]

doi: 10.1145/290941. 291025. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. Palm: Scaling language modeling with pathways.arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/290941

[2] [2]

PaLM: Scaling Language Modeling with Pathways

URLhttps://arxiv.org/abs/2204.02311. Andrew Chu, Xi Jiang, Shinan Liu, Arjun Bhagoji, Francesco Bronzino, Paul Schmitt, and Nick Feamster. NetSSM: Multi-flow and state-aware network trace generation using state space models. Proc. ACM Netw., 4(CoNEXT1),

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Yaojun Ding and Wei Chen

URLhttps://arxiv.org/abs/2503.22663. Yaojun Ding and Wei Chen. DBF-PSR: A dual-branch fusion approach to network traffic classification using protocol semantic representation.J. King Saud Univ. – Comput. Inf. Sci., 37(7),

work page arXiv

[4] [4]

Albert Gu and Tri Dao

doi: 10.1007/s44443-025-00233-w. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces,

work page doi:10.1007/s44443-025-00233-w

[5] [5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URLhttps://arxiv.org/abs/2312.00752. Satyandra Guthula, Roman Beltiukov, Navya Battula, Wenbo Guo, Arpit Gupta, and Inder Monga. netfound: Foundation model for network security,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Scaling Laws for Neural Language Models

URLhttps://arxiv.org/abs/2001.08361. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[7] [7]

Decoupled Weight Decay Regularization

URL https://arxiv.org/abs/1711.05101. Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.JOSS, 2(11):205,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Xuying Meng, Chungang Lin, Yequan Wang, and Yujun Zhang

doi: 10.21105/joss.00205. Xuying Meng, Chungang Lin, Yequan Wang, and Yujun Zhang. NetGPT: Generative pretrained transformer for network traffic,

work page doi:10.21105/joss.00205

[9] [9]

URLhttps://arxiv.org/abs/2304.09513. OpenAI. Gpt-4 technical report,

work page arXiv

[10] [10]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. Artidoro Pagnoni et al. Byte latent transformer: Patches scale better than tokens,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Jian Qu, Xiaobo Ma, and Jianfeng Li

URL https://arxiv.org/abs/2412.09871. Jian Qu, Xiaobo Ma, and Jianfeng Li. TrafficGPT: Breaking the token barrier for efficient long traffic analysis and generation,

work page arXiv

[12] [12]

Rico Sennrich, Barry Haddow, and Alexandra Birch

URLhttps://arxiv.org/abs/2403.05822. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, pages 1715–1725,

work page arXiv

[13] [13]

URL https://aclanthology.org/P16-1162/

doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162/. Łukasz Tulczyjew, Kinan Jarrah, Charles Abondo, Dina Bennett, and Nathanael Weill. LLMcap: Large language model for unsupervised PCAP failure detection,

work page doi:10.18653/v1/p16-1162

[14] [14]

org/abs/2407.06085

URL https://arxiv. org/abs/2407.06085. Qineng Wang, Chen Qian, Xiaochang Li, Ziyu Yao, and Huajie Shao. Lens: A foundation model for network traffic, 2024a. URLhttps://arxiv.org/abs/2402.03646. Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Youjian Zhao, and Yong Cui. NetMamba: Efficient network traffic classification via pre-training unidirectional m...

work page arXiv