Assessing Generalisation Capability of Machine Learning Models for Intrusion Detection
Pith reviewed 2026-05-08 18:13 UTC · model grok-4.3
The pith
Supervised machine learning models for intrusion detection show strong performance on one dataset but drop below 40 percent accuracy on another.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that machine-learning-based intrusion detection exhibits a significant generalization gap: models achieve 95 percent or higher accuracy within a single dataset such as UNSW-NB15 or TON_IoT yet fall below 40 percent accuracy in cross-dataset testing, demonstrating that high in-domain performance does not ensure reliable detection in unseen network environments.
What carries the argument
Cross-dataset evaluation protocol that trains Random Forest, Logistic Regression, or Naive Bayes on one of the two intrusion datasets and tests on the other.
If this is right
- Single-dataset benchmarks are insufficient to certify real-world reliability of intrusion detection models.
- Intrusion detection systems need explicit mechanisms for handling distribution shifts between network environments.
- Adaptive or domain-robust training methods are required for models to operate across changing IoT and networked systems.
- The same generalization challenge appears in other anomaly-detection settings that rely on behavioral signals.
Where Pith is reading between the lines
- Practical systems would probably need continuous retraining or transfer-learning steps when the network environment changes.
- The gap may be reduced by collecting training data from multiple distinct network sources rather than relying on one benchmark.
- Feature-level analysis of the two datasets could identify which traffic characteristics drive the performance collapse.
Load-bearing premise
The performance drop is caused by the models failing to generalize rather than by differences in how the two datasets were collected, labeled, or preprocessed.
What would settle it
Re-running the cross-dataset tests after forcing identical feature extraction, normalization, and label schemes on both UNSW-NB15 and TON_IoT and still observing accuracy below 40 percent would support the generalization-gap claim; the opposite result would indicate dataset artifacts instead.
Figures
read the original abstract
The growth of networked and IoT systems has intensified cyber-security threats and exposed the limits of traditional signature-based intrusion detection. Although machine-learning-based intrusion detection systems often report strong benchmark performance, high ac- curacy within a single dataset does not necessarily guarantee reliable performance in unseen network environments. This study investigates the generalisation capability of supervised machine learning models for intrusion detection using UNSW-NB15 and TON_IoT. Random Forest, Logistic Regression, and Naive Bayes were evaluated under same-dataset and cross-dataset settings. Random Forest achieved the strongest same dataset performance, with 95.08% accuracy on UNSW-NB15 and 99.79% on TON_IoT, but performance dropped sharply in cross-dataset testing. When trained on UNSW-NB15 and tested on TON_IoT or vice versa, below 40% accuracy. These results reveal a significant generalisation gap in intrusion detection. We connect this challenge to affective computing and human-centric AI, where behavioural signal analysis, anomaly detection, domain shift, and context-sensitive modelling are also central. This framing highlights the need for adaptive, generalisable cyber-security models that can operate across changing network and IoT environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates Random Forest, Logistic Regression, and Naive Bayes on UNSW-NB15 and TON_IoT for intrusion detection. It reports high same-dataset accuracies (e.g., 95.08% and 99.79% for Random Forest) but sharp drops to below 40% in cross-dataset testing, claiming this demonstrates a significant generalization gap in ML-based IDS and linking the issue to domain shift in affective computing and human-centric AI.
Significance. If the cross-dataset performance collapse is shown to arise from inability to generalize to new network environments rather than dataset-specific artifacts, the result would usefully highlight limitations of current supervised ML approaches for intrusion detection and motivate work on domain-adaptive or context-sensitive models. The purely empirical nature and direct measurement of the gap are strengths, but the absence of supporting experimental details limits the strength of the claim.
major comments (1)
- [Abstract] Abstract and methods (inferred from lack of description): the central claim that the <40% cross-dataset accuracy measures generalization failure to unseen network conditions is undermined by the absence of any description of feature preprocessing, categorical encoding, scaling, feature selection, or explicit alignment steps between UNSW-NB15 and TON_IoT. These datasets differ in collection context, feature definitions, attack taxonomies, and label distributions; without commensurate input spaces, the performance drop can be produced by incompatibility alone rather than true domain shift.
minor comments (2)
- [Abstract] Abstract: the sentence 'When trained on UNSW-NB15 and tested on TON_IoT or vice versa, below 40% accuracy.' is grammatically incomplete and should be rephrased for clarity.
- [Abstract] The connection drawn to affective computing and human-centric AI in the abstract and conclusion appears tangential and is not developed with any concrete mapping or shared methodology; this framing may distract from the core empirical contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods (inferred from lack of description): the central claim that the <40% cross-dataset accuracy measures generalization failure to unseen network conditions is undermined by the absence of any description of feature preprocessing, categorical encoding, scaling, feature selection, or explicit alignment steps between UNSW-NB15 and TON_IoT. These datasets differ in collection context, feature definitions, attack taxonomies, and label distributions; without commensurate input spaces, the performance drop can be produced by incompatibility alone rather than true domain shift.
Authors: We agree that the current manuscript lacks sufficient detail on these methodological aspects, which limits the strength of the generalization claim. In the revised version we will add an explicit Methods subsection describing the full preprocessing pipeline: categorical encoding (one-hot for nominal features with consistent category mapping across datasets), feature scaling (standardization applied after alignment), feature selection (retaining only overlapping features present in both UNSW-NB15 and TON_IoT), and the alignment procedure (manual mapping of common network-flow attributes while discarding dataset-specific fields and harmonizing attack labels to a shared taxonomy subset). These steps were performed to produce commensurate input spaces; the remaining performance collapse is therefore attributable to domain shift. We will also report the exact number of retained features and any label-distribution adjustments. This revision directly addresses the concern without changing the reported accuracy figures. revision: yes
Circularity Check
No circularity; purely empirical measurements with no derivations or self-referential loops
full rationale
The paper consists entirely of direct empirical evaluation: training Random Forest, Logistic Regression, and Naive Bayes on UNSW-NB15 and TON_IoT, then reporting accuracy numbers for same-dataset and cross-dataset splits. No equations, fitted parameters, or theoretical derivations are present that could reduce to their own inputs by construction. The central claim of a generalization gap is simply the observed performance drop (below 40% cross-dataset), which is a raw measurement rather than a prediction derived from prior fits or self-citations. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the provided text. This is a standard empirical ML benchmarking study whose results stand or fall on the reported numbers and experimental setup, with no internal circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Labeled training data from each dataset accurately reflects the intrusion patterns present in that network environment.
- domain assumption The two datasets (UNSW-NB15 and TON_IoT) differ sufficiently in distribution to serve as proxies for unseen environments.
Reference graph
Works this paper leans on
-
[1]
https://www.ibm.com/reports/data-breach
-
[2]
In: 2020 International Confer- ence on Data Science, Artificial Intelligence, and Business Analytics (DATABIA)
Adek, R.T., Ula, M.: A survey on the accuracy of machine learning techniques for intrusion and anomaly detection on public data sets. In: 2020 International Confer- ence on Data Science, Artificial Intelligence, and Business Analytics (DATABIA). pp. 19–27. IEEE (2020)
2020
-
[3]
SN Computer Science5(8), 1028 (2024)
Ajagbe, S.A., Awotunde, J.B., Florez, H.: Intrusion detection: a comparison study of machine learning models using unbalanced dataset. SN Computer Science5(8), 1028 (2024)
2024
-
[4]
Mesopotamian Journal of CyberSecurity2021, 1–4 (2021)
Aljanabi, M., Ismail, M.A., Hasan, R.A., Sulaiman, J.: Intrusion detection: A re- view. Mesopotamian Journal of CyberSecurity2021, 1–4 (2021)
2021
-
[5]
Systems Science & Control Engineering12(1), 2321381 (2024)
Almotairi, A., Atawneh, S., Khashan, O.A., Khafajah, N.M.: Enhancing intru- sion detection in iot networks using machine learning-based feature selection and ensemble models. Systems Science & Control Engineering12(1), 2321381 (2024)
2024
-
[6]
In: 2017 IEEE 15th inter- national symposium on intelligent systems and informatics (SISY)
Almseidin, M., Alzubi, M., Kovacs, S., Alkasassbeh, M.: Evaluation of machine learning algorithms for intrusion detection system. In: 2017 IEEE 15th inter- national symposium on intelligent systems and informatics (SISY). pp. 000277– 000282. IEEE (2017)
2017
-
[7]
Axelsson, S.: Intrusion detection systems: A survey and taxonomy (2000)
2000
-
[8]
In: International Conference on Digital Technologies and Applications
Azeroual, H., Belghiti, I.D., Berbiche, N.: Analysis of unsw-nb15 datasets using machine learning algorithms. In: International Conference on Digital Technologies and Applications. pp. 199–209. Springer (2022)
2022
-
[9]
Scientific Reports (2026)
Dharini, N., Janani, V., Katiravan, J.: Efficient detection of intrusions in ton-iot dataset using hybrid feature selection approach. Scientific Reports (2026)
2026
-
[10]
IEEE access9, 142206–142217 (2021)
Gad, A.R., Nashat, A.A., Barkat, T.M.: Intrusion detection system using machine learning for vehicular ad hoc networks based on ton-iot dataset. IEEE access9, 142206–142217 (2021)
2021
-
[11]
IEEE Systems Journal15(2), 1717–1731 (2020)
Gümüşbaş, D., Yıldırım, T., Genovese, A., Scotti, F.: A comprehensive survey of databases and deep learning methods for cybersecurity and intrusion detection systems. IEEE Systems Journal15(2), 1717–1731 (2020)
2020
-
[12]
In: 2017 IEEE 26th international symposium on industrial electronics (ISIE)
Janarthanan, T., Zargari, S.: Feature selection in unsw-nb15 and kddcup’99 datasets. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE). pp. 1881–1886. IEEE (2017)
2017
-
[13]
International Journal of Engineering Applied Sciences and Technology4(6), 2455–2143 (2019) Generalisation Capability of ML for Intrusion Detection 13
Kanimozhi, V., Jacob, T.P.: Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset cse- cic-ids2018 using cloud computing. International Journal of Engineering Applied Sciences and Technology4(6), 2455–2143 (2019) Generalisation Capability of ML for Intrusion Detection 13
2019
-
[14]
Kenyon, A., Deka, L., Elizondo, D.: Are public intrusion datasets fit for purpose characterisingthestateoftheartinintrusioneventdatasets.Computers&Security 99, 102022 (2020)
2020
-
[15]
BIN: Bulletin of Informatics2(2), 248–61 (2024)
Khan, M.I., Arif, A., Khan, A.R.A.: Ai-driven threat detection: A brief overview of ai techniques in cybersecurity. BIN: Bulletin of Informatics2(2), 248–61 (2024)
2024
-
[16]
In: International conference on neural information processing
Li, Z., Qin, Z., Huang, K., Yang, X., Ye, S.: Intrusion detection using convolutional neural networks for representation learning. In: International conference on neural information processing. pp. 858–866. Springer (2017)
2017
-
[17]
Expert Systems with Applications124, 196–208 (2019)
Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J.: Shallow neural network with kernel approximation for prediction problems in highly demanding data networks. Expert Systems with Applications124, 196–208 (2019)
2019
-
[18]
Computers & Security148, 104175 (2025)
Lu, H., Liu, J., Peng, J., Lu, J.: Adversarial attacks based on time-series features for traffic detection. Computers & Security148, 104175 (2025)
2025
-
[19]
IEEE access9, 22351–22370 (2021)
Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmark- ing of machine learning for anomaly based intrusion detection systems in the ci- cids2017 dataset. IEEE access9, 22351–22370 (2021)
2021
-
[20]
Moustafa, N.: A new distributed architecture for evaluating ai-based security sys- tems at the edge: Network ton_iot datasets. sustain. cities soc. 72, 102994 (2021) (2021)
2021
-
[21]
In: 2015 military communications and information systems conference (MilCIS)
Moustafa, N., Slay, J.: Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS). pp. 1–6. Ieee (2015)
2015
-
[22]
Information Security Journal: A Global Perspective25(1-3), 18–31 (2016)
Moustafa, N., Slay, J.: The evaluation of network anomaly detection systems: Sta- tistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set. Information Security Journal: A Global Perspective25(1-3), 18–31 (2016)
2016
-
[23]
In: 2022 International Conference on Advances in Computing, Communication and Materials (ICACCM)
Panwar, S.S., Raiwani, Y., Panwar, L.S.: An intrusion detection model for cicids- 2017 dataset using machine learning algorithms. In: 2022 International Conference on Advances in Computing, Communication and Materials (ICACCM). pp. 1–10. IEEE (2022)
2017
-
[24]
Procedia Computer Science171, 1251–1260 (2020)
Saranya, T., Sridevi, S., Deisy, C., Chung, T.D., Khan, M.A.: Performance analysis of machine learning algorithms in intrusion detection system: A review. Procedia Computer Science171, 1251–1260 (2020)
2020
-
[25]
In: IEEE INFOCOM 2018-IEEE conference on computer com- munications workshops (INFOCOM WKSHPS)
Zhou, Y., Han, M., Liu, L., He, J.S., Wang, Y.: Deep learning approach for cyber- attack detection. In: IEEE INFOCOM 2018-IEEE conference on computer com- munications workshops (INFOCOM WKSHPS). pp. 262–267. IEEE (2018)
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.