arxiv: 2605.02519 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.CR

Recognition: 3 theorem links

· Lean Theorem

Evaluating Tabular Representation Learning for Network Intrusion Detection

Muhammad Usman Butt , Andreas Hotho , Daniel Schl\"or

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords network intrusion detectionNetFlowtabular representation learningsupervised classificationautoencoderstransformer modelsanomaly detectioncross-dataset transfer

0 comments

The pith

Tabular representation learning techniques learn useful features from NetFlow data for intrusion detection, but no single method dominates and supervised approaches outperform unsupervised ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether modern tabular representation learning can replace manual feature engineering in network intrusion detection by automatically extracting patterns from NetFlow traffic records. It compares methods such as TabICL, autoencoders, and end-to-end transformers on established benchmark datasets, measuring performance with both supervised classifiers and unsupervised anomaly detectors after thorough hyperparameter tuning. Results show clear dataset-model interactions: TabICL leads on the CIDDS dataset, autoencoders and transformers tie for best average rank, and supervised methods clearly beat unsupervised ones. The evaluation also includes cross-dataset transfer tests, which succeed when method and classifier choices align with the source and target environments. This matters because it indicates that learned representations can reduce reliance on domain expertise while highlighting the need for careful, dataset-specific selection rather than universal solutions.

Core claim

Tabular representation learning methods automatically extract meaningful representations from NetFlow data that support intrusion detection. For supervised classification, TabICL delivers the highest performance on the CIDDS dataset while autoencoders and end-to-end transformer models achieve the best average rank across datasets. Supervised approaches using these representations substantially outperform unsupervised anomaly detection methods, where optimal choices again vary by dataset. Cross-dataset transfer experiments confirm that the learned representations can generalize across different network environments when appropriate method and classifier pairs are selected, although transferal

What carries the argument

tabular representation learning techniques applied to NetFlow datasets to produce feature representations for supervised classifiers and unsupervised anomaly detectors

If this is right

No single representation learning method or classifier combination performs best on every NetFlow dataset.
Supervised classification with learned representations consistently beats unsupervised anomaly detection across the tested scenarios.
Representations learned on one network can transfer to another when method and classifier selection accounts for distributional differences.
Comprehensive hyperparameter tuning for each method-classifier-dataset triple is required to reach competitive performance.
Transfer success varies substantially with the specific source-target dataset pair.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security teams may need to test several representation methods on their own traffic data instead of relying on a single recommended approach.
The transfer results suggest that pre-training representations on large public NetFlow corpora could reduce the data requirements for new deployments.
Strong dataset dependency implies that future benchmarks should include more varied traffic conditions, such as encrypted flows or adversarial traffic, to test robustness.
The gap between supervised and unsupervised performance points to opportunities for hybrid methods that leverage limited labels to improve anomaly detection.

Load-bearing premise

The chosen benchmark NetFlow datasets together with the hyperparameter search ranges capture enough of the variability present in real network environments for the performance and transfer conclusions to generalize.

What would settle it

Apply the identical set of representation learning methods, classifiers, and hyperparameter ranges to a fresh NetFlow dataset collected from an entirely different organizational network and check whether the reported performance rankings, average ranks, and cross-dataset transfer patterns still hold.

Figures

Figures reproduced from arXiv: 2605.02519 by Andreas Hotho, Daniel Schl\"or, Muhammad Usman Butt.

**Figure 1.** Figure 1: Cross-dataset transfer learning performance (AUC-ROC) for super view at source ↗

**Figure 2.** Figure 2: TabICL Precision per class on CIDDS-001 view at source ↗

**Figure 5.** Figure 5: SCARF Recall per class on CIDDS-001 view at source ↗

**Figure 6.** Figure 6: CC Precision per class on CIDDS-001 view at source ↗

**Figure 8.** Figure 8: TabICL Precision per class on UNSW-NB15 view at source ↗

**Figure 10.** Figure 10: SCARF Precision per class on UNSW-NB15 view at source ↗

**Figure 12.** Figure 12: CC Precision per class on UNSW-NB15 view at source ↗

**Figure 15.** Figure 15: TabICL Recall per class on CSE-CIC-IDS2018 view at source ↗

**Figure 16.** Figure 16: SCARF Precision per class on CSE-CIC-IDS2018 view at source ↗

**Figure 18.** Figure 18: CC Precision per class on CSE-CIC-IDS2018 view at source ↗

**Figure 21.** Figure 21: Per Class F1 score on UNSW-NB15 view at source ↗

**Figure 22.** Figure 22: Per Class F1 score on CSE-CIC-IDS2018 view at source ↗

**Figure 24.** Figure 24: Average Macro F1 on UNSW-NB15 view at source ↗

read the original abstract

Classic Network Intrusion Detection Systems (NIDS) often rely on manual feature engineering to extract meaningful patterns from network traffic data. However, this approach requires domain expertise and runs counter to the widely adopted principle of modern machine learning and neural networks: that models themselves should learn meaningful representations directly from data. We investigate whether tabular representation learning techniques can improve intrusion detection performance by automatically learning robust feature representations for NetFlow data. This paper presents a systematic evaluation of state-of-the-art representation learning methods on benchmark NetFlow datasets, comparing against traditional autoencoders and end-to-end transformer baselines. We evaluate learned representations using both supervised classifiers and unsupervised anomaly detectors, with comprehensive hyperparameter exploration for each combination. Our results reveal strong dataset-model dependency, with no single approach consistently dominating across all scenarios. For supervised classification, TabICL achieves the best performance on CIDDS, while autoencoders follow closely and tie with end-to-end transformer models for the best average rank across datasets. Supervised approaches substantially outperform unsupervised anomaly detection methods, where no single combination consistently dominates as optimal choices depend on the dataset. Cross-dataset transfer experiments demonstrate that learned representations can generalize across network environments with appropriate method and classifier selection. However, transfer performance varies substantially depending on the source-target dataset combination, indicating sensitivity to distributional differences between network environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A thorough empirical benchmark of tabular representation methods on NetFlow NIDS data that shows clear dataset dependence but rests transfer claims on fairly similar academic sets.

read the letter

This paper runs a careful set of experiments comparing recent tabular representation learners on standard NetFlow intrusion datasets. The core finding is dataset-model dependence: no method wins everywhere, TabICL does well on CIDDS for supervised tasks, autoencoders and transformers tie on average rank, and supervised classifiers beat unsupervised anomaly detection across the board. Cross-dataset transfer works in some source-target pairs but varies a lot with the combination chosen.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic empirical evaluation of tabular representation learning methods (TabICL, autoencoders, end-to-end transformers) for network intrusion detection using benchmark NetFlow datasets. It compares performance in supervised classification and unsupervised anomaly detection with comprehensive hyperparameter exploration, reports strong dataset-model dependencies with no universally dominant approach, and includes cross-dataset transfer experiments demonstrating variable generalization of learned representations across network environments.

Significance. If the empirical results hold, the work provides a valuable benchmark for representation learning on tabular NetFlow data, underscoring the need for dataset-specific method selection and the feasibility of transfer learning with caveats. The comprehensive hyperparameter exploration and dual supervised/unsupervised evaluation modes are strengths that could guide practitioners away from manual feature engineering in NIDS.

major comments (2)

[Cross-dataset transfer experiments] Cross-dataset transfer experiments (Abstract and results): the central claim that 'learned representations can generalize across network environments with appropriate method and classifier selection' rests on transfer results among a small set of academic NetFlow benchmarks (e.g., CIDDS and similar) that share comparable feature schemas, traffic collection methods, and attack taxonomies. No analysis of dataset similarity (e.g., MMD or feature-distribution distances) is provided to bound the magnitude of observed shifts or support extrapolation to larger real-world domain gaps.
[Results] Results section: the support for the 'strong dataset-model dependency' claim and average-rank comparisons is moderate because the manuscript provides no visible statistical tests, exact hyperparameter configurations, or full result tables, making it difficult to assess the robustness and reproducibility of the reported performance differences and ties.

minor comments (2)

[Abstract] Abstract: the claim that 'autoencoders follow closely and tie with end-to-end transformer models for the best average rank across datasets' would be strengthened by explicit reference to the numerical rank values or the corresponding table.
[Evaluation] Evaluation modes: clarify whether the unsupervised anomaly detection results use the same feature representations as the supervised classifiers or if additional preprocessing steps differ.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below and are prepared to revise the manuscript accordingly to strengthen the presentation and rigor of our empirical evaluation.

read point-by-point responses

Referee: [Cross-dataset transfer experiments] Cross-dataset transfer experiments (Abstract and results): the central claim that 'learned representations can generalize across network environments with appropriate method and classifier selection' rests on transfer results among a small set of academic NetFlow benchmarks (e.g., CIDDS and similar) that share comparable feature schemas, traffic collection methods, and attack taxonomies. No analysis of dataset similarity (e.g., MMD or feature-distribution distances) is provided to bound the magnitude of observed shifts or support extrapolation to larger real-world domain gaps.

Authors: We agree that the datasets are standard academic benchmarks sharing similar schemas and collection characteristics, which constrains the scope of the generalization claim. In the revision we will add a quantitative dataset-similarity analysis (MMD and selected feature-distribution distances) between all source-target pairs, report the resulting values alongside the transfer results, and revise the abstract and discussion to frame the observed generalization more narrowly as holding for these benchmark environments rather than claiming broad real-world applicability. revision: yes
Referee: [Results] Results section: the support for the 'strong dataset-model dependency' claim and average-rank comparisons is moderate because the manuscript provides no visible statistical tests, exact hyperparameter configurations, or full result tables, making it difficult to assess the robustness and reproducibility of the reported performance differences and ties.

Authors: We concur that statistical tests and complete reproducibility details are necessary. We will add Wilcoxon signed-rank tests (and paired tests where appropriate) to support the average-rank and performance-difference claims, include exact hyperparameter grids and selected configurations in an appendix, and release full per-run result tables (means, standard deviations, and all individual scores) as supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparative evaluation

full rationale

The manuscript reports experimental results from training and evaluating multiple tabular representation learning methods (TabICL, autoencoders, transformers) on CIDDS and similar NetFlow benchmarks, using both supervised classifiers and unsupervised detectors plus cross-dataset transfer tests. All performance claims rest on direct measurements after hyperparameter search; no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central generalization statement is an empirical observation from the transfer experiments themselves, not a reduction to prior inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine learning assumptions about benchmark data representativeness and the sufficiency of hyperparameter tuning rather than any novel axioms or invented entities.

free parameters (1)

model-specific hyperparameters
Extensive search performed for each method-dataset-classifier combination, with values chosen to optimize performance on the benchmarks.

axioms (1)

domain assumption Benchmark NetFlow datasets are representative of real network traffic distributions
Invoked to support claims of generalization and transferability across network environments.

pith-pipeline@v0.9.0 · 5529 in / 1207 out tokens · 26794 ms · 2026-05-08T18:58:31.286056+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlphaCoordinateFixation.lean (RS pins constants with zero adjustable parameters; this paper depends on extensive hyperparameter tuning — methodologically opposite but domain-disjoint) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multiple hyperparameter configurations are explored for each model via grid search on validation sets.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages

[1]

Representation learning: A review and new perspectives,

Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013

2013
[2]

Cisco IOS NetFlow - Cisco,

Cisco Systems, “Cisco IOS NetFlow - Cisco,” https://www.cisco.com/c/ en/us/products/ios-nx-os-software/ios-netflow/index.html, p. 1, 2018

2018
[3]

Encode: Encoding netflows for network anomaly detection,

C. Cao, A. Panichella, S. Verwer, A. Blaise, and F. Rebecchi, “Encode: Encoding netflows for network anomaly detection,”arXiv preprint arXiv:2207.03890, 2022

work page arXiv 2022
[4]

Resampling imbalanced data for network intrusion detection datasets,

S. Bagui and K. Li, “Resampling imbalanced data for network intrusion detection datasets,”Journal of Big Data, vol. 8, 01 2021

2021
[5]

Flow-based network traffic generation using generative adversarial networks,

M. Ring, D. Schl ¨or, D. Landes, and A. Hotho, “Flow-based network traffic generation using generative adversarial networks,”Computers & Security, vol. 82, pp. 156–172, 2019

2019
[6]

Efficient representations for high-cardinality categorical vari- ables in machine learning,

Z. Liang, “Efficient representations for high-cardinality categorical vari- ables in machine learning,”arXiv preprint arXiv:2501.05646, 2025

work page arXiv 2025
[7]

TabICL: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025

J. Qu, D. Holzm ¨uller, G. Varoquaux, and M. L. Morvan, “Tabicl: A tabular foundation model for in-context learning on large data,”arXiv preprint arXiv:2502.05564, 2025

work page arXiv 2025
[8]

Scarf: Self-supervised contrastive learning using random feature corruption

D. Bahri, H. Jiang, Y . Tay, and D. Metzler, “Scarf: Self-supervised contrastive learning using random feature corruption,”arXiv preprint arXiv:2106.15147, 2021

work page arXiv 2021
[9]

Tabu- lar data contrastive learning via class-conditioned and feature-correlation based augmentation,

W. Cui, R. Hosseinzadeh, J. Ma, T. Wu, Y . Sui, and K. Golestan, “Tabu- lar data contrastive learning via class-conditioned and feature-correlation based augmentation,”arXiv preprint arXiv:2404.17489, 2024

work page arXiv 2024
[10]

Creation of flow-based data sets for intrusion detection,

M. Ring, S. Wunderlich, D. Gr ¨udl, D. Landes, and A. Hotho, “Creation of flow-based data sets for intrusion detection,”Journal of Information Warfare, vol. 16, no. 4, pp. 41–54, 2017

2017
[11]

Advanced ids: a comparative study of datasets and machine learning algorithms for network flow-based intrusion detection systems,

J. C. Mondragon, P. Branco, G.-V . Jourdan, A. E. Gutierrez-Rodriguez, and R. R. Biswal, “Advanced ids: a comparative study of datasets and machine learning algorithms for network flow-based intrusion detection systems,”Applied Intelligence, vol. 55, no. 7, p. 608, 2025

2025
[12]

A systematic literature review for network intrusion detection system (ids),

O. H. Abdulganiyu, T. Ait Tchakoucht, and Y . K. Saheed, “A systematic literature review for network intrusion detection system (ids),”Interna- tional journal of information security, vol. 22, no. 5, pp. 1125–1162, 2023

2023
[13]

Deep learning methods in network intrusion detection: A survey and an objective comparison,

S. Gamage and J. Samarabandu, “Deep learning methods in network intrusion detection: A survey and an objective comparison,”Journal of Network and Computer Applications, vol. 169, p. 102767, 2020

2020
[14]

Anomal-e: A self- supervised network intrusion detection system based on graph neural networks,

E. Caville, W. W. Lo, S. Layeghy, and M. Portmann, “Anomal-e: A self- supervised network intrusion detection system based on graph neural networks,”Knowledge-based systems, vol. 258, p. 110030, 2022

2022
[15]

Representation learning for tabular data: A comprehensive survey,

J.-P. Jiang, S.-Y . Liu, H.-R. Cai, Q. Zhou, and H.-J. Ye, “Representation learning for tabular data: A comprehensive survey,”arXiv preprint arXiv:2504.16109, 2025

work page arXiv 2025
[16]

Unsupervised visual repre- sentation learning by context prediction,

C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual repre- sentation learning by context prediction,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430

2015
[17]

TabTransformer: Tabular data modeling using contextual embeddings,

X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “Tabtransformer: Tabular data modeling using contextual embeddings,”arXiv preprint arXiv:2012.06678, 2020

work page arXiv 2012
[18]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PmLR, 2020, pp. 1597–1607

2020
[19]

Contrastive representation learning: A framework and review,

P. H. Le-Khac, G. Healy, and A. F. Smeaton, “Contrastive representation learning: A framework and review,”Ieee Access, vol. 8, pp. 193 907– 193 934, 2020

2020
[20]

Effective network intrusion detection via representation learning: A denoising autoencoder approach,

I. O. Lopes, D. Zou, I. H. Abdulqadder, F. A. Ruambo, B. Yuan, and H. Jin, “Effective network intrusion detection via representation learning: A denoising autoencoder approach,”Computer Communications, vol. 194, pp. 55–65, 2022

2022
[21]

Representation learning-based network intrusion detection system by capturing explicit and implicit feature interactions,

W. Wang, S. Jian, Y . Tan, Q. Wu, and C. Huang, “Representation learning-based network intrusion detection system by capturing explicit and implicit feature interactions,”Computers & Security, vol. 112, p. 102537, 2022

2022
[22]

A multiple-layer representation learning model for network-based attack detection,

X. Zhang, J. Chen, Y . Zhou, L. Han, and J. Lin, “A multiple-layer representation learning model for network-based attack detection,”IEEE access, vol. 7, pp. 91 992–92 008, 2019

2019
[23]

Towards a standard feature set for network intrusion detection system datasets,

M. Sarhan, S. Layeghy, and M. Portmann, “Towards a standard feature set for network intrusion detection system datasets,”Mobile networks and applications, vol. 27, no. 1, pp. 357–370, 2022

2022
[24]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[25]

We need to rethink benchmarking in anomaly detection,

P. R ¨ochner, S. Kl ¨uttermann, F. Rothlauf, and D. Schl ¨or, “We need to rethink benchmarking in anomaly detection,”arXiv preprint arXiv:2507.15584, 2025

work page arXiv 2025
[26]

Smote: synthetic minority over-sampling technique,

N. V . Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,”Journal of artificial intel- ligence research, vol. 16, pp. 321–357, 2002

2002
[27]

Systematic evaluation of synthetic data augmentation for multi-class netflow traffic,

M. Wolf, D. Landes, A. Hotho, and D. Schl ¨or, “Systematic evaluation of synthetic data augmentation for multi-class netflow traffic,”arXiv preprint arXiv:2408.16034, 2024

work page arXiv 2024
[28]

Evaluating feature relevance xai in network intrusion detection,

J. Tritscher, M. Wolf, A. Hotho, and D. Schl ¨or, “Evaluating feature relevance xai in network intrusion detection,” inWorld Conference on Explainable Artificial Intelligence. Springer, 2023, pp. 483–497. APPENDIX Dataset Class Distribution This section presents the class distribution statistics for the datasets used in our experiments. TABLE V UNSW-NB15 ...

2023