On Designing Machine Learning Models for Malicious Network Traffic Classification

Alina Oprea; Simona Boboila; Talha Ongun; Timothy Sakharaov; Tina Eliassi-Rad

arxiv: 1907.04846 · v1 · pith:XSCAQH5Snew · submitted 2019-07-10 · 💻 cs.CR · cs.LG· stat.ML

On Designing Machine Learning Models for Malicious Network Traffic Classification

Talha Ongun , Timothy Sakharaov , Simona Boboila , Alina Oprea , Tina Eliassi-Rad This is my paper

Pith reviewed 2026-05-24 23:41 UTC · model grok-4.3

classification 💻 cs.CR cs.LGstat.ML

keywords machine learningbotnet detectionnetwork traffic classificationcyber securityfeature representationensemble modelsground truth granularityclass imbalance

0 comments

The pith

Machine learning models for botnet detection from network traffic improve when features reflect attack characteristics, ensembles address imbalance, and ground truth labels are sufficiently detailed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out concrete guidelines for applying supervised machine learning to classify malicious network traffic, taking botnet detection as its running example. It shows that feature sets perform better when they are built around the specific behaviors of the attacks being targeted. Ensemble classifiers turn out to be particularly effective at managing the severe class imbalance typical of network data. The level of detail used when labeling the training examples also turns out to have a large effect on final accuracy.

Core claim

In the botnet detection case study, supervised machine learning succeeds when feature representations incorporate attack characteristics, when ensemble models are chosen to cope with class imbalance, and when the granularity of the ground truth labels is chosen with care.

What carries the argument

The botnet detection case study on network traffic data, used to test variations in feature design, model choice, and label granularity.

If this is right

Attack-specific feature engineering raises detection performance on botnet traffic.
Ensemble methods reduce the impact of class imbalance in network traffic datasets.
Coarser or finer ground truth labels can change measured accuracy by a noticeable margin.
Public benchmark datasets would make these design choices easier to compare across studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three design rules could be checked on intrusion detection or malware traffic tasks to see whether they travel.
Streaming or online versions of these models would need additional handling for concept drift that the static case study does not address.
Real deployments might also need to weigh the computational cost of ensembles against their accuracy gains on imbalanced data.

Load-bearing premise

The results obtained on this particular botnet detection task will carry over to other kinds of malicious network traffic classification.

What would settle it

A follow-up experiment that applies the same models and features to several different malicious traffic types and finds no advantage for attack-aware features or ensembles would undermine the guidelines.

Figures

Figures reproduced from arXiv: 1907.04846 by Alina Oprea, Simona Boboila, Talha Ongun, Timothy Sakharaov, Tina Eliassi-Rad.

**Figure 1.** Figure 1: Fields in Bro connection log. • Can raw network data be used effectively in an ML algorithm? • Which feature representations are most appropriate for applying ML classification algorithms? • Which classifiers achieve best performance in handling the largely imbalanced cyber-security datasets? • What is the impact of labeling the data for ground truth generation? We assume that the monitoring agent, which … view at source ↗

**Figure 2.** Figure 2: Overview of the system architecture. scenario available and that precluded the use of supervised ML. In traditional ML, cross-validation is a well-known method to evaluate the generalization of a model. k-fold crossvalidation splits the data into k partitions at random, trains a model on k −1 of them and evaluates it on the k-th partition. Splitting the logs at random produces highly-correlated data betw… view at source ↗

**Figure 4.** Figure 4: Precision-recall curves for three classifiers for Neris. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Machine learning (ML) started to become widely deployed in cyber security settings for shortening the detection cycle of cyber attacks. To date, most ML-based systems are either proprietary or make specific choices of feature representations and machine learning models. The success of these techniques is difficult to assess as public benchmark datasets are currently unavailable. In this paper, we provide concrete guidelines and recommendations for using supervised ML in cyber security. As a case study, we consider the problem of botnet detection from network traffic data. Among our findings we highlight that: (1) feature representations should take into consideration attack characteristics; (2) ensemble models are well-suited to handle class imbalance; (3) the granularity of ground truth plays an important role in the success of these methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Botnet case study yields three practical guidelines but rests on single-attack data with no cross-validation shown.

read the letter

The main thing to know is that the authors ran botnet detection experiments and extracted three concrete recommendations: design features with attack characteristics in mind, use ensembles to manage class imbalance, and pay attention to ground-truth granularity. They do a reasonable job of documenting their setup and showing how those choices affected outcomes in their case study, which addresses the lack of public benchmarks noted in the abstract. That part is useful for people who need rules of thumb rather than new theory. The soft spot is the narrow scope. All three findings come from one botnet dataset and attack type. The stress-test note is on target here: without testing the same ideas on other malicious traffic patterns such as scanning or exfiltration, there is no way to separate general effects from artifacts of that particular data. The abstract gives no sign of multi-attack comparisons or sensitivity checks. This paper is for applied researchers and engineers working on network security detectors who want empirical pointers from a real deployment attempt. It is not field-changing, but the experiments are grounded enough to deserve referee time so the details can be checked and the scope limitation can be discussed openly. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical case study on supervised machine learning for botnet detection from network traffic data and derives three practical guidelines for ML-based malicious network traffic classification: feature representations should incorporate attack characteristics, ensemble models are suitable for class imbalance, and ground-truth granularity affects method success.

Significance. If the case-study results hold under the reported conditions, the work supplies concrete, practitioner-oriented recommendations on feature design and model choice for imbalanced cybersecurity tasks. The explicit framing as a case study and the focus on real-world considerations such as labeling granularity are positive; however, the absence of cross-attack validation limits the strength of any broader claims.

major comments (2)

[Abstract, §1] Abstract and opening of §1: the manuscript positions its three findings as 'concrete guidelines and recommendations for using supervised ML in cyber security,' yet every experiment and result is confined to a single botnet-detection dataset; no sensitivity analysis or comparison across attack types (scanning, C&C, exfiltration) is provided, which is load-bearing for the generality of the stated guidelines.
[Experimental sections (results tables)] Experimental evaluation sections: the claim that 'ensemble models are well-suited to handle class imbalance' rests on performance numbers from the botnet data alone; without reporting the imbalance ratios, the precise ensemble variants, or ablation against non-ensemble baselines on the same splits, it is impossible to isolate the contribution of the ensemble choice from dataset-specific artifacts.

minor comments (2)

[§3] Notation for feature sets and ground-truth labels is introduced without a consolidated table; a single reference table would improve readability.
[§5 or conclusion] The paper does not state whether code or the exact dataset splits are released; adding this information would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below while preserving the case-study framing of the work.

read point-by-point responses

Referee: [Abstract, §1] Abstract and opening of §1: the manuscript positions its three findings as 'concrete guidelines and recommendations for using supervised ML in cyber security,' yet every experiment and result is confined to a single botnet-detection dataset; no sensitivity analysis or comparison across attack types (scanning, C&C, exfiltration) is provided, which is load-bearing for the generality of the stated guidelines.

Authors: The manuscript explicitly introduces the work as a case study on botnet detection and derives the three observations from the empirical results obtained on that dataset. We do not claim the guidelines are proven for arbitrary attack types. We will revise the abstract and opening of §1 to state more clearly that the guidelines are derived from this specific case study and that validation on additional attack types would be required to assess wider applicability. revision: yes
Referee: [Experimental sections (results tables)] Experimental evaluation sections: the claim that 'ensemble models are well-suited to handle class imbalance' rests on performance numbers from the botnet data alone; without reporting the imbalance ratios, the precise ensemble variants, or ablation against non-ensemble baselines on the same splits, it is impossible to isolate the contribution of the ensemble choice from dataset-specific artifacts.

Authors: We will revise the experimental sections to state the class-imbalance ratios explicitly, name the precise ensemble variants used, and present the comparisons to non-ensemble baselines on identical splits in a dedicated table or paragraph so that the contribution of the ensemble choice can be isolated from dataset-specific effects. revision: yes

Circularity Check

0 steps flagged

Empirical case study with no derivation chain or self-referential reductions

full rationale

This is an empirical paper presenting experimental findings from a single botnet detection case study on network traffic data. The three highlighted results (attack-aware features, ensembles for imbalance, ground-truth granularity) are direct observations from model training and evaluation runs, not outputs of any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes appear. The paper explicitly frames its contributions as case-study guidelines rather than general derivations, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical guidelines paper; no mathematical model, free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5669 in / 882 out tokens · 22512 ms · 2026-05-24T23:41:55.041382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Antonakakis, R

M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for DNS. In Proc. 19th USENIX Security Symposium, 2010

work page 2010
[2]

Antonakakis, R

M. Antonakakis, R. Perdisci, Y . Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon. From throw- away trafﬁc to bots: Detecting the rise of DGA-based malware. In Proc. 21st USENIX Security Symposium, 2012

work page 2012
[3]

Bartos, M

K. Bartos, M. Sofka, and V . Franc. Optimized invariant representation of network trafﬁc for detecting unseen malware variants. In 25th USENIX Security Sympo- sium (USENIX Security 16), pages 807–822. USENIX Association, 2016

work page 2016
[4]

Bilge, D

L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, and C. Kruegel. DISCLOSURE: Detecting botnet Command-and-Control servers through large-scale Net- Flow analysis. In Proc. 28th Annual Computer Security Applications Conference (ACSAC), ACSAC, 2012

work page 2012
[5]

Bilge, E

L. Bilge, E. Kirda, K. Christopher, and M. Balduzzi. EXPOSURE: Finding malicious domains using passive DNS analysis. In Proc. 18th Symposium on Network and Distributed System Security, NDSS, 2011

work page 2011
[6]

Using Deep Learning To Detect DGAs

Endgame. Using Deep Learning To Detect DGAs. https://www. endgame.com/blog/technical-blog/ using-deep-learning-detect-dgas , 2016

work page 2016
[7]

Reverse Engineering the Ana- lyst: Building Machine Learning Models for the SOC

FireEye. Reverse Engineering the Ana- lyst: Building Machine Learning Models for the SOC. https://www.fireeye. com/blog/threat-research/2018/06/ build-machine-learning-models-for-the-soc. html, 2018

work page 2018
[8]

X. Hu, J. Jang, M. P. Stoecklin, T. Wang, D. L. Schales, D. Kirat, and J. R. Rao. BAYWATCH: robust beacon- ing detection to identify infected hosts in large-scale enterprise networks. In DSN, pages 479–490. IEEE Computer Society, 2016

work page 2016
[9]

Invernizzi, S

L. Invernizzi, S. Miskovic, R. Torres, S. Saha, S.-J. Lee, C. Kruegel, and G. Vigna. Nazca: Detecting malware distribution in large-scale networks. In Proc. ISOC Network and Distributed System Security Symposium (NDSS ’14), 2014

work page 2014
[10]

Machine Learning in Azure Security Center

Microsoft. Machine Learning in Azure Security Center. https:// azure.microsoft.com/en-us/blog/ machine-learning-in-azure-security-center/ , 2016

work page 2016
[11]

Nelms, R

T. Nelms, R. Perdisci, and M. Ahamad. ExecScent: Min- ing for new C&C domains in live networks with adap- tive control protocol templates. In Proc. 22nd USENIX Security Symposium, 2013

work page 2013
[12]

Oprea, Z

A. Oprea, Z. Li, R. Norris, and K. Bowers. MADE: Security analytics for enterprise threat detection. In Proc. Annual Computer Security Applications Confer- ence (ACSAC), ACSAC, 2018

work page 2018
[13]

Threat Detection and Response NetWitness Platform

RSA. Threat Detection and Response NetWitness Platform. https://www.rsa.com/en-us/ products/threat-detection-response, 2018

work page 2018
[14]

Sommer and V

R. Sommer and V . Paxson. Outside the closed world: On using machine learning for network intrusion detection. In Proc. IEEE Symposium on Security and Privacy, SP ’10. IEEE Computer Society, 2010. 8

work page 2010
[15]

Stringhini, C

G. Stringhini, C. Kruegel, and G. Vigna. Shady Paths: Leveraging surﬁng crowds to detect malicious web pages. In Proc. 20th ACM Conference on Computer and Communications Security, CCS, 2013

work page 2013
[16]

How does Symantec Endpoint Protection use advanced machine learning? https://support.symantec.com/en_ US/article.HOWTO125816.html, 2018

Symantec. How does Symantec Endpoint Protection use advanced machine learning? https://support.symantec.com/en_ US/article.HOWTO125816.html, 2018

work page 2018
[17]

Tegeler, X

F. Tegeler, X. Fu, G. Vigna, and C. Kruegel. BotFinder: Finding bots in network trafﬁc without deep packet in- spection. In Proc. 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT, 2012. 9

work page 2012

[1] [1]

Antonakakis, R

M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for DNS. In Proc. 19th USENIX Security Symposium, 2010

work page 2010

[2] [2]

Antonakakis, R

M. Antonakakis, R. Perdisci, Y . Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon. From throw- away trafﬁc to bots: Detecting the rise of DGA-based malware. In Proc. 21st USENIX Security Symposium, 2012

work page 2012

[3] [3]

Bartos, M

K. Bartos, M. Sofka, and V . Franc. Optimized invariant representation of network trafﬁc for detecting unseen malware variants. In 25th USENIX Security Sympo- sium (USENIX Security 16), pages 807–822. USENIX Association, 2016

work page 2016

[4] [4]

Bilge, D

L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, and C. Kruegel. DISCLOSURE: Detecting botnet Command-and-Control servers through large-scale Net- Flow analysis. In Proc. 28th Annual Computer Security Applications Conference (ACSAC), ACSAC, 2012

work page 2012

[5] [5]

Bilge, E

L. Bilge, E. Kirda, K. Christopher, and M. Balduzzi. EXPOSURE: Finding malicious domains using passive DNS analysis. In Proc. 18th Symposium on Network and Distributed System Security, NDSS, 2011

work page 2011

[6] [6]

Using Deep Learning To Detect DGAs

Endgame. Using Deep Learning To Detect DGAs. https://www. endgame.com/blog/technical-blog/ using-deep-learning-detect-dgas , 2016

work page 2016

[7] [7]

Reverse Engineering the Ana- lyst: Building Machine Learning Models for the SOC

FireEye. Reverse Engineering the Ana- lyst: Building Machine Learning Models for the SOC. https://www.fireeye. com/blog/threat-research/2018/06/ build-machine-learning-models-for-the-soc. html, 2018

work page 2018

[8] [8]

X. Hu, J. Jang, M. P. Stoecklin, T. Wang, D. L. Schales, D. Kirat, and J. R. Rao. BAYWATCH: robust beacon- ing detection to identify infected hosts in large-scale enterprise networks. In DSN, pages 479–490. IEEE Computer Society, 2016

work page 2016

[9] [9]

Invernizzi, S

L. Invernizzi, S. Miskovic, R. Torres, S. Saha, S.-J. Lee, C. Kruegel, and G. Vigna. Nazca: Detecting malware distribution in large-scale networks. In Proc. ISOC Network and Distributed System Security Symposium (NDSS ’14), 2014

work page 2014

[10] [10]

Machine Learning in Azure Security Center

Microsoft. Machine Learning in Azure Security Center. https:// azure.microsoft.com/en-us/blog/ machine-learning-in-azure-security-center/ , 2016

work page 2016

[11] [11]

Nelms, R

T. Nelms, R. Perdisci, and M. Ahamad. ExecScent: Min- ing for new C&C domains in live networks with adap- tive control protocol templates. In Proc. 22nd USENIX Security Symposium, 2013

work page 2013

[12] [12]

Oprea, Z

A. Oprea, Z. Li, R. Norris, and K. Bowers. MADE: Security analytics for enterprise threat detection. In Proc. Annual Computer Security Applications Confer- ence (ACSAC), ACSAC, 2018

work page 2018

[13] [13]

Threat Detection and Response NetWitness Platform

RSA. Threat Detection and Response NetWitness Platform. https://www.rsa.com/en-us/ products/threat-detection-response, 2018

work page 2018

[14] [14]

Sommer and V

R. Sommer and V . Paxson. Outside the closed world: On using machine learning for network intrusion detection. In Proc. IEEE Symposium on Security and Privacy, SP ’10. IEEE Computer Society, 2010. 8

work page 2010

[15] [15]

Stringhini, C

G. Stringhini, C. Kruegel, and G. Vigna. Shady Paths: Leveraging surﬁng crowds to detect malicious web pages. In Proc. 20th ACM Conference on Computer and Communications Security, CCS, 2013

work page 2013

[16] [16]

How does Symantec Endpoint Protection use advanced machine learning? https://support.symantec.com/en_ US/article.HOWTO125816.html, 2018

Symantec. How does Symantec Endpoint Protection use advanced machine learning? https://support.symantec.com/en_ US/article.HOWTO125816.html, 2018

work page 2018

[17] [17]

Tegeler, X

F. Tegeler, X. Fu, G. Vigna, and C. Kruegel. BotFinder: Finding bots in network trafﬁc without deep packet in- spection. In Proc. 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT, 2012. 9

work page 2012