pith. sign in

arxiv: 2605.04407 · v1 · submitted 2026-05-06 · 💻 cs.CR

Assessing Generalisation Capability of Machine Learning Models for Intrusion Detection

Pith reviewed 2026-05-08 18:13 UTC · model grok-4.3

classification 💻 cs.CR
keywords intrusion detectionmachine learninggeneralizationUNSW-NB15TON_IoTcross-dataset evaluationcyber securitysupervised learning
0
0 comments X

The pith

Supervised machine learning models for intrusion detection show strong performance on one dataset but drop below 40 percent accuracy on another.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three supervised models on two standard intrusion detection datasets to check whether good benchmark results hold up in new network settings. Random Forest, Logistic Regression, and Naive Bayes all reach high accuracy when trained and tested inside the same dataset. Performance collapses when the models are trained on one dataset and evaluated on the other. The results point to a clear generalization problem that limits reliable use across varying IoT and networked environments. The authors also note parallels with anomaly detection challenges in affective computing.

Core claim

The central claim is that machine-learning-based intrusion detection exhibits a significant generalization gap: models achieve 95 percent or higher accuracy within a single dataset such as UNSW-NB15 or TON_IoT yet fall below 40 percent accuracy in cross-dataset testing, demonstrating that high in-domain performance does not ensure reliable detection in unseen network environments.

What carries the argument

Cross-dataset evaluation protocol that trains Random Forest, Logistic Regression, or Naive Bayes on one of the two intrusion datasets and tests on the other.

If this is right

  • Single-dataset benchmarks are insufficient to certify real-world reliability of intrusion detection models.
  • Intrusion detection systems need explicit mechanisms for handling distribution shifts between network environments.
  • Adaptive or domain-robust training methods are required for models to operate across changing IoT and networked systems.
  • The same generalization challenge appears in other anomaly-detection settings that rely on behavioral signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practical systems would probably need continuous retraining or transfer-learning steps when the network environment changes.
  • The gap may be reduced by collecting training data from multiple distinct network sources rather than relying on one benchmark.
  • Feature-level analysis of the two datasets could identify which traffic characteristics drive the performance collapse.

Load-bearing premise

The performance drop is caused by the models failing to generalize rather than by differences in how the two datasets were collected, labeled, or preprocessed.

What would settle it

Re-running the cross-dataset tests after forcing identical feature extraction, normalization, and label schemes on both UNSW-NB15 and TON_IoT and still observing accuracy below 40 percent would support the generalization-gap claim; the opposite result would indicate dataset artifacts instead.

Figures

Figures reproduced from arXiv: 2605.04407 by Md Ayshik Rahman Khan, Md Rafiqul Islam, Md Zakir Hossain, Syed Mohammed Shamsul Islam, Tom Gedeon.

Figure 1
Figure 1. Figure 1: Development model used in this study. rate [13]. Taken together, these studies reinforce the view that effectiveness of intrusion detection systems depends on both the dataset and the attack context, and cross-dataset evaluation still remains limited, which makes generalization to real-world scenarios harder to assess. Although many studies report high accuracy on individual benchmark datasets, evaluation … view at source ↗
Figure 2
Figure 2. Figure 2: Feature correlation heatmap of the UNSW-NB15 dataset. accuracy, 96.21% precision, 96.08% recall, and 96.15% F1-score. Logistic Re￾gression also performed well, achieving 89.53% accuracy and the highest recall among the three models at 97.22%. Naive Bayes achieved the lowest accuracy at 81.47%, but its precision and F1-score indicate that it still provided a reasonable baseline performance on this dataset view at source ↗
Figure 3
Figure 3. Figure 3: Top ten important features identified by the Random Forest model on the TON_IoT dataset. DNS-related features also had strong influence on the model. Features such as dns_rejected, dns_RA, and dns_AA appeared among the top contributors, sug￾gesting that DNS behaviour was useful for separating normal and malicious traf￾fic in the TON_IoT dataset. In addition, connection-level and direction-based view at source ↗
read the original abstract

The growth of networked and IoT systems has intensified cyber-security threats and exposed the limits of traditional signature-based intrusion detection. Although machine-learning-based intrusion detection systems often report strong benchmark performance, high ac- curacy within a single dataset does not necessarily guarantee reliable performance in unseen network environments. This study investigates the generalisation capability of supervised machine learning models for intrusion detection using UNSW-NB15 and TON_IoT. Random Forest, Logistic Regression, and Naive Bayes were evaluated under same-dataset and cross-dataset settings. Random Forest achieved the strongest same dataset performance, with 95.08% accuracy on UNSW-NB15 and 99.79% on TON_IoT, but performance dropped sharply in cross-dataset testing. When trained on UNSW-NB15 and tested on TON_IoT or vice versa, below 40% accuracy. These results reveal a significant generalisation gap in intrusion detection. We connect this challenge to affective computing and human-centric AI, where behavioural signal analysis, anomaly detection, domain shift, and context-sensitive modelling are also central. This framing highlights the need for adaptive, generalisable cyber-security models that can operate across changing network and IoT environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper evaluates Random Forest, Logistic Regression, and Naive Bayes on UNSW-NB15 and TON_IoT for intrusion detection. It reports high same-dataset accuracies (e.g., 95.08% and 99.79% for Random Forest) but sharp drops to below 40% in cross-dataset testing, claiming this demonstrates a significant generalization gap in ML-based IDS and linking the issue to domain shift in affective computing and human-centric AI.

Significance. If the cross-dataset performance collapse is shown to arise from inability to generalize to new network environments rather than dataset-specific artifacts, the result would usefully highlight limitations of current supervised ML approaches for intrusion detection and motivate work on domain-adaptive or context-sensitive models. The purely empirical nature and direct measurement of the gap are strengths, but the absence of supporting experimental details limits the strength of the claim.

major comments (1)
  1. [Abstract] Abstract and methods (inferred from lack of description): the central claim that the <40% cross-dataset accuracy measures generalization failure to unseen network conditions is undermined by the absence of any description of feature preprocessing, categorical encoding, scaling, feature selection, or explicit alignment steps between UNSW-NB15 and TON_IoT. These datasets differ in collection context, feature definitions, attack taxonomies, and label distributions; without commensurate input spaces, the performance drop can be produced by incompatibility alone rather than true domain shift.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'When trained on UNSW-NB15 and tested on TON_IoT or vice versa, below 40% accuracy.' is grammatically incomplete and should be rephrased for clarity.
  2. [Abstract] The connection drawn to affective computing and human-centric AI in the abstract and conclusion appears tangential and is not developed with any concrete mapping or shared methodology; this framing may distract from the core empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods (inferred from lack of description): the central claim that the <40% cross-dataset accuracy measures generalization failure to unseen network conditions is undermined by the absence of any description of feature preprocessing, categorical encoding, scaling, feature selection, or explicit alignment steps between UNSW-NB15 and TON_IoT. These datasets differ in collection context, feature definitions, attack taxonomies, and label distributions; without commensurate input spaces, the performance drop can be produced by incompatibility alone rather than true domain shift.

    Authors: We agree that the current manuscript lacks sufficient detail on these methodological aspects, which limits the strength of the generalization claim. In the revised version we will add an explicit Methods subsection describing the full preprocessing pipeline: categorical encoding (one-hot for nominal features with consistent category mapping across datasets), feature scaling (standardization applied after alignment), feature selection (retaining only overlapping features present in both UNSW-NB15 and TON_IoT), and the alignment procedure (manual mapping of common network-flow attributes while discarding dataset-specific fields and harmonizing attack labels to a shared taxonomy subset). These steps were performed to produce commensurate input spaces; the remaining performance collapse is therefore attributable to domain shift. We will also report the exact number of retained features and any label-distribution adjustments. This revision directly addresses the concern without changing the reported accuracy figures. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical measurements with no derivations or self-referential loops

full rationale

The paper consists entirely of direct empirical evaluation: training Random Forest, Logistic Regression, and Naive Bayes on UNSW-NB15 and TON_IoT, then reporting accuracy numbers for same-dataset and cross-dataset splits. No equations, fitted parameters, or theoretical derivations are present that could reduce to their own inputs by construction. The central claim of a generalization gap is simply the observed performance drop (below 40% cross-dataset), which is a raw measurement rather than a prediction derived from prior fits or self-citations. No load-bearing self-citations, ansatzes, or uniqueness theorems appear in the provided text. This is a standard empirical ML benchmarking study whose results stand or fall on the reported numbers and experimental setup, with no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions and the representativeness of the two chosen datasets; no new entities or fitted constants are introduced.

axioms (2)
  • domain assumption Labeled training data from each dataset accurately reflects the intrusion patterns present in that network environment.
    Invoked when interpreting same-dataset high accuracy as model capability and cross-dataset drop as generalization failure.
  • domain assumption The two datasets (UNSW-NB15 and TON_IoT) differ sufficiently in distribution to serve as proxies for unseen environments.
    Central to the cross-dataset testing design described in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1199 out tokens · 85353 ms · 2026-05-08T18:13:28.016221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references

  1. [1]

    https://www.ibm.com/reports/data-breach

  2. [2]

    In: 2020 International Confer- ence on Data Science, Artificial Intelligence, and Business Analytics (DATABIA)

    Adek, R.T., Ula, M.: A survey on the accuracy of machine learning techniques for intrusion and anomaly detection on public data sets. In: 2020 International Confer- ence on Data Science, Artificial Intelligence, and Business Analytics (DATABIA). pp. 19–27. IEEE (2020)

  3. [3]

    SN Computer Science5(8), 1028 (2024)

    Ajagbe, S.A., Awotunde, J.B., Florez, H.: Intrusion detection: a comparison study of machine learning models using unbalanced dataset. SN Computer Science5(8), 1028 (2024)

  4. [4]

    Mesopotamian Journal of CyberSecurity2021, 1–4 (2021)

    Aljanabi, M., Ismail, M.A., Hasan, R.A., Sulaiman, J.: Intrusion detection: A re- view. Mesopotamian Journal of CyberSecurity2021, 1–4 (2021)

  5. [5]

    Systems Science & Control Engineering12(1), 2321381 (2024)

    Almotairi, A., Atawneh, S., Khashan, O.A., Khafajah, N.M.: Enhancing intru- sion detection in iot networks using machine learning-based feature selection and ensemble models. Systems Science & Control Engineering12(1), 2321381 (2024)

  6. [6]

    In: 2017 IEEE 15th inter- national symposium on intelligent systems and informatics (SISY)

    Almseidin, M., Alzubi, M., Kovacs, S., Alkasassbeh, M.: Evaluation of machine learning algorithms for intrusion detection system. In: 2017 IEEE 15th inter- national symposium on intelligent systems and informatics (SISY). pp. 000277– 000282. IEEE (2017)

  7. [7]

    Axelsson, S.: Intrusion detection systems: A survey and taxonomy (2000)

  8. [8]

    In: International Conference on Digital Technologies and Applications

    Azeroual, H., Belghiti, I.D., Berbiche, N.: Analysis of unsw-nb15 datasets using machine learning algorithms. In: International Conference on Digital Technologies and Applications. pp. 199–209. Springer (2022)

  9. [9]

    Scientific Reports (2026)

    Dharini, N., Janani, V., Katiravan, J.: Efficient detection of intrusions in ton-iot dataset using hybrid feature selection approach. Scientific Reports (2026)

  10. [10]

    IEEE access9, 142206–142217 (2021)

    Gad, A.R., Nashat, A.A., Barkat, T.M.: Intrusion detection system using machine learning for vehicular ad hoc networks based on ton-iot dataset. IEEE access9, 142206–142217 (2021)

  11. [11]

    IEEE Systems Journal15(2), 1717–1731 (2020)

    Gümüşbaş, D., Yıldırım, T., Genovese, A., Scotti, F.: A comprehensive survey of databases and deep learning methods for cybersecurity and intrusion detection systems. IEEE Systems Journal15(2), 1717–1731 (2020)

  12. [12]

    In: 2017 IEEE 26th international symposium on industrial electronics (ISIE)

    Janarthanan, T., Zargari, S.: Feature selection in unsw-nb15 and kddcup’99 datasets. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE). pp. 1881–1886. IEEE (2017)

  13. [13]

    International Journal of Engineering Applied Sciences and Technology4(6), 2455–2143 (2019) Generalisation Capability of ML for Intrusion Detection 13

    Kanimozhi, V., Jacob, T.P.: Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset cse- cic-ids2018 using cloud computing. International Journal of Engineering Applied Sciences and Technology4(6), 2455–2143 (2019) Generalisation Capability of ML for Intrusion Detection 13

  14. [14]

    Kenyon, A., Deka, L., Elizondo, D.: Are public intrusion datasets fit for purpose characterisingthestateoftheartinintrusioneventdatasets.Computers&Security 99, 102022 (2020)

  15. [15]

    BIN: Bulletin of Informatics2(2), 248–61 (2024)

    Khan, M.I., Arif, A., Khan, A.R.A.: Ai-driven threat detection: A brief overview of ai techniques in cybersecurity. BIN: Bulletin of Informatics2(2), 248–61 (2024)

  16. [16]

    In: International conference on neural information processing

    Li, Z., Qin, Z., Huang, K., Yang, X., Ye, S.: Intrusion detection using convolutional neural networks for representation learning. In: International conference on neural information processing. pp. 858–866. Springer (2017)

  17. [17]

    Expert Systems with Applications124, 196–208 (2019)

    Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J.: Shallow neural network with kernel approximation for prediction problems in highly demanding data networks. Expert Systems with Applications124, 196–208 (2019)

  18. [18]

    Computers & Security148, 104175 (2025)

    Lu, H., Liu, J., Peng, J., Lu, J.: Adversarial attacks based on time-series features for traffic detection. Computers & Security148, 104175 (2025)

  19. [19]

    IEEE access9, 22351–22370 (2021)

    Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmark- ing of machine learning for anomaly based intrusion detection systems in the ci- cids2017 dataset. IEEE access9, 22351–22370 (2021)

  20. [20]

    Moustafa, N.: A new distributed architecture for evaluating ai-based security sys- tems at the edge: Network ton_iot datasets. sustain. cities soc. 72, 102994 (2021) (2021)

  21. [21]

    In: 2015 military communications and information systems conference (MilCIS)

    Moustafa, N., Slay, J.: Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS). pp. 1–6. Ieee (2015)

  22. [22]

    Information Security Journal: A Global Perspective25(1-3), 18–31 (2016)

    Moustafa, N., Slay, J.: The evaluation of network anomaly detection systems: Sta- tistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set. Information Security Journal: A Global Perspective25(1-3), 18–31 (2016)

  23. [23]

    In: 2022 International Conference on Advances in Computing, Communication and Materials (ICACCM)

    Panwar, S.S., Raiwani, Y., Panwar, L.S.: An intrusion detection model for cicids- 2017 dataset using machine learning algorithms. In: 2022 International Conference on Advances in Computing, Communication and Materials (ICACCM). pp. 1–10. IEEE (2022)

  24. [24]

    Procedia Computer Science171, 1251–1260 (2020)

    Saranya, T., Sridevi, S., Deisy, C., Chung, T.D., Khan, M.A.: Performance analysis of machine learning algorithms in intrusion detection system: A review. Procedia Computer Science171, 1251–1260 (2020)

  25. [25]

    In: IEEE INFOCOM 2018-IEEE conference on computer com- munications workshops (INFOCOM WKSHPS)

    Zhou, Y., Han, M., Liu, L., He, J.S., Wang, Y.: Deep learning approach for cyber- attack detection. In: IEEE INFOCOM 2018-IEEE conference on computer com- munications workshops (INFOCOM WKSHPS). pp. 262–267. IEEE (2018)