pith. sign in

arxiv: 2310.17025 · v5 · submitted 2023-10-25 · 💻 cs.NI · cs.AI

netFound: Principled Design for Network Foundation Models

Pith reviewed 2026-05-24 06:42 UTC · model grok-4.3

classification 💻 cs.NI cs.AI
keywords network foundation modelstraffic representation learningprotocol-aware tokenizationburst-flow hierarchical attentionprivacy-by-constructionexogenous context discriminationembedding anisotropy
0
0 comments X

The pith

netFound applies four principles from model failure diagnostics to produce network embeddings with 0.95 F1 on context discrimination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing network foundation models fail by exploiting dataset shortcuts rather than learning genuine traffic patterns, producing collapsed embedding spaces, and ignoring exogenous network conditions. The paper translates these diagnostic problems into four concrete design principles: protocol-aware tokenization, operational context embedding, burst-flow hierarchical attention, and privacy-by-construction input design. netFound implements these principles, pretrains on a billion-token corpus, and yields representations with lower anisotropy and stronger alignment to domain-expert features. This matters because reusable embeddings that actually reflect real traffic behavior could support many downstream analysis tasks without repeated full retraining. The design also excludes payload and IP addresses to preserve privacy by construction.

Core claim

netFound is a network foundation model whose architecture is motivated by diagnostic analysis of why prior models fail. By incorporating protocol-aware tokenization, operational context embedding, burst-flow hierarchical attention, and privacy-by-construction input design, and pretraining on a large-scale corpus of 4.2 billion flows, it produces high-quality representations with lower anisotropy, significantly higher alignment with domain-expert features, and an F1 of 0.95 on exogenous context discrimination where existing models score below 0.62. It outperforms baselines in both frozen-encoder and end-to-end fine-tuned evaluations while excluding payload and IP addresses.

What carries the argument

Four design principles—protocol-aware tokenization, operational context embedding, burst-flow hierarchical attention, and privacy-by-construction input design—that translate identified failure modes into architectural choices for the netFound model.

If this is right

  • Pretrained netFound embeddings carry useful structure that improves performance in frozen-encoder settings across benchmarks.
  • The model remains the top performer in all end-to-end fine-tuned evaluations.
  • Representations exhibit lower anisotropy and higher alignment with domain-expert features than prior models.
  • Privacy is preserved by design through exclusion of payload and IP addresses without performance loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar translation of diagnostic failure analysis into explicit design rules could be tested in other sequence or graph domains where shortcut exploitation occurs.
  • The released full pipeline from raw PCAPs to inference and the 4.2 billion flow dataset enable direct checks of whether the gains hold on traffic from different network operators.
  • Better modeling of exogenous conditions may improve robustness when models encounter previously unseen network environments or policy changes.

Load-bearing premise

The diagnostic findings from prior work correctly identify the root causes of failure in existing network foundation models, and the four proposed design principles directly mitigate those causes without introducing compensating weaknesses or evaluation artifacts.

What would settle it

Evaluation of netFound on a new dataset engineered to retain the same shortcuts as the pretraining data but alter the genuine traffic patterns, checking whether the F1 on context discrimination drops below 0.8 or anisotropy increases to prior-model levels.

Figures

Figures reproduced from arXiv: 2310.17025 by Arpit Gupta, Haarika Manda, Inder Monga, Jaber Daneshamooz, Satyandra Guthula, Sylee Beltiukov, Walter Willinger, Wenbo Guo.

Figure 1
Figure 1. Figure 1: Comparison between a naive hierarchical model [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data extraction, Featurization & Protocol￾aware Tokenization: Pipeline for converting the packet traces into tokens with metadata. After the flows are ex￾tracted from packet traces, we collect the relevant fields into features at different granularities, following which we convert them into tokens. dependencies across bursts, which cannot be modeled by naive structure in Figure 1a. More importantly, the ou… view at source ↗
Figure 3
Figure 3. Figure 3: Pre-training—the hierarchical transformer uses a [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The token prediction performance between net [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The testing performance of netFound and baselines [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Network foundation models promise reusable representations for diverse traffic analysis tasks, but recent diagnostic works have revealed fundamental problems: models exploit dataset shortcuts rather than learning genuine traffic patterns, produce collapsed embedding spaces, and fail to capture the exogenous network conditions that shape real-world behavior. We translate these diagnostic insights into four concrete design principles: protocol-aware tokenization, operational context embedding, burst-flow hierarchical attention, and privacy-by-construction input design, and build netFound, a network foundation model whose architecture is motivated by this failure analysis. We pretrain netFound on a billion-token-scale corpus over 5000 GPU hours, and demonstrate that it produces high-quality representations with lower anisotropy, significantly higher alignment with domain-expert features, and an F1 of 0.95 on exogenous context discrimination where existing state-of-the-art models score below 0.62, while preserving privacy by excluding payload and IP addresses. netFound demonstrates significant improvements in frozen-encoder evaluation, showing that pretrained embeddings themselves carry useful structure, and remains the top performer across all benchmarks in end-to-end fine-tuned settings. We release full open-source code, weights for three model sizes on HuggingFace, a containerized pipeline from raw PCAPs to downstream inference, and the full 4.2 billion flows pretraining dataset to facilitate reproducibility and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that diagnostic insights into failures of existing network foundation models (shortcut exploitation, embedding collapse, failure to capture exogenous conditions) can be translated into four design principles—protocol-aware tokenization, operational context embedding, burst-flow hierarchical attention, and privacy-by-construction—to produce netFound. After pretraining on a 4.2B-flow corpus (billion-token scale, 5000 GPU hours), netFound yields lower-anisotropy embeddings with higher alignment to domain-expert features, achieves F1=0.95 on exogenous context discrimination (vs. <0.62 for prior SOTA), preserves privacy by excluding payload/IP, and leads both frozen-encoder and fine-tuned benchmarks. Full code, three model weights, containerized pipeline, and the pretraining dataset are released.

Significance. If the results hold under controlled evaluation, the work is significant for supplying the first architecture whose components are explicitly derived from published failure diagnostics rather than scale alone, for demonstrating usable structure in the frozen pretrained embeddings, and for the unusually complete reproducibility package (dataset, pipeline, weights). These elements would materially advance the design of reusable traffic representations.

major comments (2)
  1. [Evaluation / Experimental sections] The central claim—that the four design principles directly mitigate the diagnosed failure modes—rests on the reported performance deltas (F1 0.95 vs. <0.62, improved anisotropy and alignment). No ablation variants are described that remove or disable individual principles while holding model size, data volume, and training compute fixed; therefore the deltas cannot be attributed to the principles rather than corpus scale or training duration.
  2. [Abstract and §5 (results)] Abstract and results sections report quantitative improvements but supply no details on baseline implementations, statistical significance tests, train/validation/test splits, or potential evaluation confounds (e.g., distribution shift between pretraining and downstream tasks). These omissions prevent assessment of whether the numbers support the claim that netFound representations are genuinely higher-quality.
minor comments (2)
  1. [Model architecture section] Notation for the hierarchical attention mechanism and context-embedding layers should be defined more explicitly (e.g., with a small diagram or pseudocode) to allow readers to verify alignment with the stated design principles.
  2. [Input design / privacy section] The privacy-by-construction claim would be strengthened by an explicit statement of what information is discarded at tokenization time and confirmation that no IP or payload bytes reach the model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Experimental sections] The central claim—that the four design principles directly mitigate the diagnosed failure modes—rests on the reported performance deltas (F1 0.95 vs. <0.62, improved anisotropy and alignment). No ablation variants are described that remove or disable individual principles while holding model size, data volume, and training compute fixed; therefore the deltas cannot be attributed to the principles rather than corpus scale or training duration.

    Authors: We agree that the manuscript does not include ablation studies that isolate the contribution of each design principle under controlled conditions. Each principle is explicitly motivated by a specific failure mode from prior diagnostic literature, and netFound is shown to outperform prior models that lack these components. However, without ablations we cannot rigorously rule out contributions from scale or training factors. In the revised manuscript we will add ablation experiments training variants with individual principles disabled while holding model size, data volume, and compute fixed. revision: yes

  2. Referee: [Abstract and §5 (results)] Abstract and results sections report quantitative improvements but supply no details on baseline implementations, statistical significance tests, train/validation/test splits, or potential evaluation confounds (e.g., distribution shift between pretraining and downstream tasks). These omissions prevent assessment of whether the numbers support the claim that netFound representations are genuinely higher-quality.

    Authors: The full manuscript provides an experimental setup description in §5, but we acknowledge that explicit details on baseline re-implementations, statistical significance testing, precise train/validation/test splits, and distribution-shift analysis are insufficient. In the revised version we will expand §5 and the appendix to include these elements, enabling direct assessment of result validity and potential confounds. revision: yes

Circularity Check

0 steps flagged

No circularity; design principles are motivated by external citations and results are empirical benchmarks

full rationale

The manuscript contains no equations, derivations, or fitted parameters presented as predictions. It cites prior diagnostic works to motivate four design principles, then reports empirical performance on frozen-encoder and fine-tuned tasks against external state-of-the-art models. No self-citation is load-bearing for a mathematical claim, and no step reduces by construction to its own inputs. The central claims rest on observable benchmark deltas rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the four design principles derived from prior diagnostics are sufficient to produce genuinely better representations; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption Diagnostic insights from prior work on shortcut exploitation, collapsed embeddings, and missing exogenous context correctly identify the root causes that must be fixed.
    The paper states that it translates these diagnostic insights into the four design principles.

pith-pipeline@v0.9.0 · 5793 in / 1251 out tokens · 22778 ms · 2026-05-24T06:42:30.923235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 3 internal anchors

  1. [1]

    On the effectiveness of machine and deep learning for cyber security,

    G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido, and M. Marchetti, “On the effectiveness of machine and deep learning for cyber security,” 2018 10th International Conference on Cyber Conflict (CyCon), pp. 371–390, 2018. [Online]. Available: https://api. semanticscholar.org/CorpusID:49656174

  2. [2]

    A survey on machine learning techniques for cyber security in the last decade,

    K. Shaukat, S. Luo, V . Varadharajan, I. A. Hameed, and M. Xu, “A survey on machine learning techniques for cyber security in the last decade,” IEEE Access, vol. 8, pp. 222 310–222 354, 2020

  3. [3]

    Outside the closed world: On using machine learning for network intrusion detection,

    R. Sommer and V . Paxson, “Outside the closed world: On using machine learning for network intrusion detection,” in 2010 IEEE Symposium on Security and Privacy , 2010, pp. 305–316

  4. [4]

    Underspecification presents challenges for credibility in modern machine learning,

    A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen et al. , “Underspecification presents challenges for credibility in modern machine learning,” Journal of Machine Learning Research, 2022. [Online]. Available: http://jmlr.org/papers/ v23/20-1335.html

  5. [5]

    In search of netunicorn: A data-collection platform to develop generalizable ml models for network security problems,

    R. Beltiukov, W. Guo, A. Gupta, and W. Willinger, “In search of netunicorn: A data-collection platform to develop generalizable ml models for network security problems,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2023

  6. [6]

    A look behind the curtain: Traffic classification in an increasingly encrypted web,

    I. Akbari, M. A. Salahuddin, L. Ven, N. Limam, R. Boutaba, B. Mathieu, S. Moteau, and S. Tuffin, “A look behind the curtain: Traffic classification in an increasingly encrypted web,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 5, no. 1, feb 2021. [Online]. Available: https://doi.org/10.1145/3447382

  7. [7]

    Ac-dc: Adaptive ensemble classification for network traffic identifi- cation,

    X. Jiang, S. Liu, S. Naama, F. Bronzino, P. Schmitt, and N. Feamster, “Ac-dc: Adaptive ensemble classification for network traffic identifi- cation,” 2023

  8. [8]

    Fine-grained TLS services classification with reject option,

    J. Luxemburk and T. ˇCejka, “Fine-grained TLS services classification with reject option,” Computer Networks , vol. 220, 2022. [Online]. Available: http://arxiv.org/abs/2202.11984

  9. [9]

    Error prevalence in nids datasets: A case study on cic-ids-2017 and cse- cic-ids-2018,

    L. Liu, G. Engelen, T. Lynar, D. Essam, and W. Joosen, “Error prevalence in nids datasets: A case study on cic-ids-2017 and cse- cic-ids-2018,” in 2022 IEEE Conference on Communications and Network Security (CNS) , 2022, pp. 254–262

  10. [10]

    Ai/ml for network security: The emperor has no clothes,

    A. S. Jacobs, R. Beltiukov, W. Willinger, R. A. Ferreira, A. Gupta, and L. Z. Granville, “Ai/ml for network security: The emperor has no clothes,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (CCS) , 2022

  11. [11]

    Dos and don’ts of machine learning in computer security,

    D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, “Dos and don’ts of machine learning in computer security,” in 31st USENIX Security Symposium (USENIX Security 22) . Boston, MA: USENIX Association, Aug. 2022, pp. 3971–3988. [Online]. Available: https://www.usenix.org/conference/usenixsecurity22/presen...

  12. [12]

    A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,

    C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” 2023

  13. [13]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,” 2023

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understand- ing,” ArXiv, vol. abs/1810.04805, 2019

  15. [15]

    Vision transformer architecture search,

    X. Su, S. You, J. Xie, M. Zheng, F. Wang, C. Qian, C. Zhang, X. Wang, and C. Xu, “Vision transformer architecture search,” ArXiv, vol. abs/2106.13700, 2021

  16. [16]

    Pinot: Programmable infrastructure for networking,

    R. Beltiukov, S. Chandrasekaran, A. Gupta, and W. Willinger, “Pinot: Programmable infrastructure for networking,” in Proceedings of the Applied Networking Research Workshop , ser. ANRW ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 51–53. [Online]. Available: https://doi.org/10.1145/3606464.3606485

  17. [17]

    Experience-driven research on programmable networks,

    H. Kim, X. Chen, J. Brassil, and J. Rexford, “Experience-driven research on programmable networks,” SIGCOMM Comput. Commun. Rev., vol. 51, no. 1, p. 10–17, mar 2021. [Online]. Available: https://doi.org/10.1145/3457175.3457178

  18. [18]

    Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification,

    X. Lin, G. Xiong, G. Gou, Z. Li, J. Shi, and J. Yu, “Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification,” in Proceedings of the ACM Web Conference 2022 , ser. WWW ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 633–642. [Online]. Available: https://doi.org/10.1145/34...

  19. [19]

    Yet another traffic classifier: A masked autoencoder based traffic transformer with multi-level flow representation,

    R. Zhao, M. Zhan, X. Deng, Y . Wang, Y . Wang, G. Gui, and Z. Xue, “Yet another traffic classifier: A masked autoencoder based traffic transformer with multi-level flow representation,” Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 4, pp. 5420–5427, Jun. 2023. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/...

  20. [20]

    Multi- classification approaches for classifying mobile app traffic,

    G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Multi- classification approaches for classifying mobile app traffic,” J. Netw. Comput. Appl., vol. 103, pp. 131–145, 2018

  21. [21]

    FlowPrint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic,

    T. van Ede, R. Bortolameotti, A. Continella, J. Ren, D. J. Dubois, M. Lindorfer, D. Choffness, M. van Steen, and A. Peter, “FlowPrint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic,” in NDSS. The Internet Society, 2020

  22. [22]

    A survey on encrypted network traffic analysis applications, techniques, and countermeasures,

    E. Papadogiannaki and S. Ioannidis, “A survey on encrypted network traffic analysis applications, techniques, and countermeasures,” ACM Comput. Surv. , vol. 54, no. 6, jul 2021. [Online]. Available: https://doi.org/10.1145/3457904

  23. [23]

    End-to-end en- crypted traffic classification with one-dimensional convolution neural networks,

    W. Wang, M. Zhu, J. Wang, X. Zeng, and Z. Yang, “End-to-end en- crypted traffic classification with one-dimensional convolution neural networks,” 2017 IEEE International Conference on Intelligence and Security Informatics (ISI) , pp. 43–48, 2017

  24. [24]

    Characterization of tor traffic using time based features,

    A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani, “Characterization of tor traffic using time based features,” in Inter- national Conference on Information Systems Security and Privacy , 2017

  25. [25]

    Flag: Flow representation generator based on self-supervised learning for encrypted traffic classification,

    W. Wei, T. Ju, H. Liao, W. Zhao, and H. Gu, “Flag: Flow representation generator based on self-supervised learning for encrypted traffic classification,” in 5th Asia-Pacific Workshop on Networking (APNet 2021) , ser. APNet 2021. New York, NY , USA: Association for Computing Machinery, 2022, p. 14–20. [Online]. Available: https://doi.org/10.1145/3469393.3469394 14

  26. [26]

    A comparative study of network traffic representations for novelty detection,

    K. Yang, S. Kpotufe, and N. Feamster, “A comparative study of network traffic representations for novelty detection,” CoRR, vol. abs/2006.16993, 2020. [Online]. Available: https://arxiv.org/abs/ 2006.16993

  27. [27]

    New directions in automated traffic analysis,

    J. Holland, P. Schmitt, N. Feamster, and P. Mittal, “New directions in automated traffic analysis,” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS) , 2020

  28. [28]

    Deep-full-range: A deep learning based network encrypted traffic classification and intrusion detection framework,

    Y . Zeng, H. Gu, W. Wenting, and Y . Guo, “Deep-full-range: A deep learning based network encrypted traffic classification and intrusion detection framework,” IEEE Access, vol. PP, pp. 1–1, 01 2019

  29. [29]

    Kitsune: An ensemble of autoencoders for online network intrusion detection,

    Y . Mirsky, T. Doitshman, Y . Elovici, and A. Shabtai, “Kitsune: An ensemble of autoencoders for online network intrusion detection,” in 25th Annual Network and Distributed System Security Symposium, NDSS. The Internet Society, 2018

  30. [30]

    Machine learning for botnet detection: An optimized feature selection approach,

    M. Lefoane, I. Ghafir, S. Kabir, and I.-U. Awan, “Machine learning for botnet detection: An optimized feature selection approach,” in The 5th International Conference on Future Networks & Distributed Systems, ser. ICFNDS 2021. New York, NY , USA: Association for Computing Machinery, 2022, p. 195–200. [Online]. Available: https://doi.org/10.1145/3508072.3508102

  31. [31]

    A survey on data-driven software vulnerability assessment and prioritization,

    T. H. M. Le, H. Chen, and M. A. Babar, “A survey on data-driven software vulnerability assessment and prioritization,” ACM Comput. Surv. , vol. 55, no. 5, dec 2022. [Online]. Available: https://doi.org/10.1145/3529757

  32. [32]

    Network traffic classifier with convolutional and recurrent neural networks for internet of things,

    M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, and J. Lloret, “Network traffic classifier with convolutional and recurrent neural networks for internet of things,” IEEE access , vol. 5, pp. 18 042– 18 050, 2017

  33. [33]

    Flowpic: Encrypted internet traffic classifi- cation is as easy as image recognition,

    T. Shapira and Y . Shavitt, “Flowpic: Encrypted internet traffic classifi- cation is as easy as image recognition,”IEEE INFOCOM 2019 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 680–687, 2019

  34. [34]

    A neural attention model for real-time network intrusion detection,

    M. Tan, A. Iacovazzi, N.-M. M. Cheung, and Y . Elovici, “A neural attention model for real-time network intrusion detection,” in 2019 IEEE 44th conference on local computer networks (LCN) . IEEE, 2019, pp. 291–299

  35. [35]

    Deep packet: A novel approach for encrypted traffic classification using deep learning,

    M. Lotfollahi, M. Jafari Siavoshani, R. Shirali Hossein Zade, and M. Saberian, “Deep packet: A novel approach for encrypted traffic classification using deep learning,” Soft Computing , vol. 24, no. 3, pp. 1999–2012, 2020

  36. [36]

    Large-scale mobile app iden- tification using deep learning,

    S. Rezaei, B. Kroencke, and X. Liu, “Large-scale mobile app iden- tification using deep learning,” IEEE Access , vol. 8, pp. 348–362, 2019

  37. [37]

    Encrypted network traffic classification using deep and parallel network-in- network models,

    Z. Bu, B. Zhou, P. Cheng, K. Zhang, and Z.-H. Ling, “Encrypted network traffic classification using deep and parallel network-in- network models,” Ieee Access, vol. 8, pp. 132 950–132 959, 2020

  38. [38]

    Byte segment neural network for network traffic classification,

    R. Li, X. Xiao, S. Ni, H. Zheng, and S. Xia, “Byte segment neural network for network traffic classification,” in 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS) . IEEE, 2018, pp. 1–10

  39. [39]

    Mt-flowformer: A semi-supervised flow transformer for encrypted traffic classification,

    R. Zhao, X. Deng, Z. Yan, J. Ma, Z. Xue, and Y . Wang, “Mt-flowformer: A semi-supervised flow transformer for encrypted traffic classification,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , ser. KDD ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 2576–2584. [Online]. Available: https://do...

  40. [40]

    Inferring streaming video quality from encrypted traffic: Practical models and deployment experience,

    F. Bronzino, P. Schmitt, S. Ayoubi, G. Martins, R. Teixeira, and N. Feamster, “Inferring streaming video quality from encrypted traffic: Practical models and deployment experience,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 3, no. 3, dec 2019. [Online]. Available: https://doi.org/10.1145/3366704

  41. [41]

    Privateeye: Scalable and privacy-preserving compro- mise detection in the cloud,

    B. Arzani, S. Ciraci, S. Saroiu, A. Wolman, J. W. Stokes, G. Outhred, and L. Diwu, “Privateeye: Scalable and privacy-preserving compro- mise detection in the cloud,” in Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation , ser. NSDI’20. USA: USENIX Association, 2020, p. 797–816

  42. [42]

    Pert: Payload encoding representation from transformer for encrypted traffic classification,

    H. Y . He, Z. Guo Yang, and X. N. Chen, “Pert: Payload encoding representation from transformer for encrypted traffic classification,” in 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation (ITU K), 2020, pp. 1–8

  43. [43]

    Flow-mae: Leveraging masked autoencoder for accurate, efficient and robust malicious traffic classification,

    Z. Hang, Y . Lu, Y . Wang, and Y . Xie, “Flow-mae: Leveraging masked autoencoder for accurate, efficient and robust malicious traffic classification,” in Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses , ser. RAID ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 297–314. [Online]. Avail...

  44. [44]

    Mtsecurity: Privacy-preserving malicious traffic classification using graph neural network and transformer,

    J. Yang, X. Jiang, Y . Lei, W. Liang, Z. Ma, and S. Li, “Mtsecurity: Privacy-preserving malicious traffic classification using graph neural network and transformer,”IEEE Transactions on Network and Service Management, pp. 1–1, 2024

  45. [45]

    Trafficgpt: Breaking the token barrier for efficient long traffic analysis and genera- tion,

    J. Qu, X. Ma, and J. Li, “Trafficgpt: Breaking the token barrier for efficient long traffic analysis and genera- tion,” ArXiv, vol. abs/2403.05822, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:268351552

  46. [46]

    Lens: A foundation model for network traffic in cyberse- curity,

    Q. Wang, C. Qian, X. Li, Z. Yao, and H. Shao, “Lens: A foundation model for network traffic in cyberse- curity,” ArXiv, vol. abs/2402.03646, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:267628222

  47. [47]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021

  48. [48]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” 2021

  49. [49]

    Auto-encoding variational bayes,

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2022

  50. [50]

    Generative adversarial networks,

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” 2014

  51. [51]

    Long-short transformer: Efficient transformers for language and vision,

    C. Zhu, W. Ping, C. Xiao, M. Shoeybi, T. Goldstein, A. Anandkumar, and B. Catanzaro, “Long-short transformer: Efficient transformers for language and vision,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 17 723– 17 736. [Online]...

  52. [52]

    Hier- archical attention networks for document classification,

    Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hier- archical attention networks for document classification,” in NAACL, 2016

  53. [53]

    Deepvsa: Facili- tating value-set analysis with deep learning for postmortem program analysis

    W. Guo, D. Mu, X. Xing, M. Du, and D. Song, “Deepvsa: Facili- tating value-set analysis with deep learning for postmortem program analysis.” in USENIX Security Symposium , 2019

  54. [54]

    Hierarchical transformers are more efficient language models,

    P. Nawrot, S. Tworkowski, M. Tyrolski, L. Kaiser, Y . Wu, C. Szegedy, and H. Michalewski, “Hierarchical transformers are more efficient language models,” in Findings of the Association for Computational Linguistics: NAACL 2022 , M. Carpuat, M.-C. de Marneffe, and I. V . Meza Ruiz, Eds. Seattle, United States: Association for Computational Linguistics, Jul...

  55. [55]

    Word embeddings: A survey,

    F. Almeida and G. Xexéo, “Word embeddings: A survey,” 2023

  56. [56]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

  57. [57]

    Should you mask 15% in masked language modeling?

    A. Wettig, T. Gao, Z. Zhong, and D. Chen, “Should you mask 15% in masked language modeling?” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , A. Vlachos and I. Augenstein, Eds. Dubrovnik, Croatia: Association for Computational Linguistics, May 2023, pp. 2985–3000. [Online]. Available: https:/...

  58. [58]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980 , 2014

  59. [59]

    Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic,

    T. van Ede, R. Bortolameotti, A. Continella, J. Ren, D. J. Dubois, M. Lindorfer, D. R. Choffnes, M. van Steen, and A. Peter, “Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic,” Proceedings 2020 Network and Distributed System Security Symposium , 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:211265114

  60. [60]

    Characterization of encrypted and vpn traffic using time-related fea- tures,

    G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, and A. A. Ghorbani, “Characterization of encrypted and vpn traffic using time-related fea- tures,” in International Conference on Information Systems Security and Privacy, 2016

  61. [61]

    Toward gen- erating a new intrusion detection dataset and intrusion traffic char- acterization,

    I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward gen- erating a new intrusion detection dataset and intrusion traffic char- acterization,” in Proceedings of the 4th International Conference on Information Systems Security and Privacy - Volume 1: ICISSP , , INSTICC. SciTePress, 2018, pp. 108–116

  62. [62]

    Patator,

    “Patator,” https://github.com/lanjelot/patator

  63. [63]

    Random decision forests,

    T. K. Ho, “Random decision forests,” in Proceedings of 3rd inter- national conference on document analysis and recognition , vol. 1. IEEE, 1995, pp. 278–282

  64. [64]

    Support-vector networks,

    C. Cortes and V . Vapnik, “Support-vector networks,” Machine learn- ing, vol. 20, no. 3, pp. 273–297, 1995

  65. [65]

    The probable error of a mean,

    Student, “The probable error of a mean,” Biometrika, pp. 1–25, 1908

  66. [66]

    From grim reality to practical solution: Malware classification in real-world noise,

    X. Wu, W. Guo, J. Yan, B. Coskun, and X. Xing, “From grim reality to practical solution: Malware classification in real-world noise,” in 2023 IEEE Symposium on Security and Privacy (SP), 2023, pp. 2602– 2619

  67. [67]

    Longformer: The long- document transformer,

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020

  68. [68]

    A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities,

    A. Alshamrani, S. Myneni, A. Chowdhary, and D. Huang, “A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities,” IEEE Communications Surveys & Tutorials , vol. 21, no. 2, pp. 1851–1877, 2019

  69. [69]

    Deep metric learning: A survey,

    M. Kaya and H. ¸ S. Bilge, “Deep metric learning: A survey,” Symme- try, vol. 11, no. 9, p. 1066, 2019

  70. [70]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021

  71. [71]

    A new hope for network model generalization,

    A. Dietmüller, S. Ray, R. Jacob, and L. Vanbever, “A new hope for network model generalization,” in Proceedings of the 21st ACM Workshop on Hot Topics in Networks , ser. HotNets ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 152–159. [Online]. Available: https://doi.org/10.1145/3563766.3564104

  72. [72]

    Towards transferable adversarial attacks on vision transformers,

    Z. Wei, J. Chen, M. Goldblum, Z. Wu, T. Goldstein, and Y .-G. Jiang, “Towards transferable adversarial attacks on vision transformers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 2668–2676

  73. [73]

    Adversarial attack and defense technologies in natural language processing: A survey,

    S. Qiu, Q. Liu, S. Zhou, and W. Huang, “Adversarial attack and defense technologies in natural language processing: A survey,” Neu- rocomputing, vol. 492, pp. 278–307, 2022. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0925231222003861 Appendix In this appendix, we provide brief information about traffic distribution between di...