Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

Canada); Hany Ragab (1) ((1) Royal Military College of Canada; Kingston; Zach Moczkodan (1)

arxiv: 2606.11098 · v1 · pith:IKI254QUnew · submitted 2026-06-09 · 💻 cs.CR · cs.LG

Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

Zach Moczkodan (1) , Hany Ragab (1) ((1) Royal Military College of Canada , Kingston , Canada) This is my paper

Pith reviewed 2026-06-27 12:32 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords intrusion detectiontransformerstemporal sequencesCIC-IDS2017padding effectsleakage-free evaluationnetwork flowsmacro-F1

0 comments

The pith

Padding convention determines whether Transformers outperform other models on temporal network intrusion detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reformulates CIC-IDS2017 as a sequence task using ordered flow sequences from network conversations and compares nine models across random splits, leakage-free splits, and different padding methods. It finds that the Transformer reaches the highest macro-F1 score of 0.89 only on non-padded windows; zero-padding plus masking causes a 0.24 drop in its score while LSTM, GRU, and 1D-CNN scores stay stable. Under leakage-free group splits the Random Forest shows the smallest performance change and the Transformer false-alarm rate rises from 0.04 percent to 2.7 percent. The authors conclude that split protocol and padding choice affect reported performance more than architectural differences and recommend leakage-free splits plus explicit padding disclosure as standard practice.

Core claim

When CIC-IDS2017 is turned into ordered flow sequences, the Transformer records the highest macro-F1 of any tested model on genuinely sequential non-padded windows, yet the same model loses 0.24 macro-F1 under zero-pad-plus-mask evaluation while recurrent and convolutional baselines remain unchanged; under leakage-free group splits the Random Forest proves most stable and the Transformer false-alarm rate increases 67-fold.

What carries the argument

Padding convention (non-padded windows versus zero-pad-plus-mask) together with train-test split protocol (random versus leakage-free group splits) applied to ordered flow sequences.

If this is right

Transformers need non-padded sequence inputs to realize their reported advantage on this task.
Zero-padding plus masking produces results that do not reflect the model's behavior on actual sequential inputs.
Leakage-free group splits expose a large increase in Transformer false alarms that random splits conceal.
Random Forest remains the most stable model once padding and split artifacts are removed.
Reported near-perfect scores on CIC-IDS2017 can be inflated by up to 0.24 macro-F1 when padding and split choices favor one architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same padding sensitivity may appear in other security tasks that turn packet streams into fixed-length windows.
Future benchmarks could test whether the performance gap closes when all models receive identical non-padded variable-length inputs.
Researchers working with any recurrent or attention model on network data may need to publish padding code alongside accuracy numbers.

Load-bearing premise

That the ordered flow sequences built from CIC-IDS2017 conversations contain genuine temporal structure and no artificial patterns or label leakage.

What would settle it

Re-running the nine-model comparison on the same ordered sequences but with a different padding scheme that keeps the Transformer macro-F1 within 0.05 of its non-padded value while the other models stay unchanged.

Figures

Figures reproduced from arXiv: 2606.11098 by Canada), Hany Ragab (1) ((1) Royal Military College of Canada, Kingston, Zach Moczkodan (1).

**Figure 1.** Figure 1: Static versus temporal intrusion detection. Static models classify one network flow at a time, whereas temporal models exploit ordered flow sequences within a conversation before producing the IDS decision. Most existing approaches treat CIC-IDS2017 as a static tabular classification problem, where each network flow is independently classified. However, network traffic is inherently temporal, and ignoring … view at source ↗

**Figure 2.** Figure 2: Main contribution. We re-formulate CIC-IDS2017 as a real temporal sequence task by grouping flows on their fivetuple and constructing T=20 sliding windows, then benchmark nine architectures across four evaluation protocols. TABLE I: Preprocessing: columns removed from the model feature set before training. Every dropped column is either an identifier, a timestamp, or the label itself. Column Type Reason f… view at source ↗

**Figure 3.** Figure 3: Per-class F1 under the random 80/20 split, mean across [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Sequence length ablation across all four temporal [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Per-sample inference latency vs. sequence length [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Cost vs. accuracy at T=20 (log-scale latency, mean macro-F1 over three seeds; RF latency measured on CPU). The 1D-CNN, GRU and LSTM form the Pareto frontier; the CNN–Transformer matches the LSTM’s macro-F1 at ∼13× the latency, and the Random Forest is likewise dominated by the LSTM. TABLE VI: Static models on the full 15-class flow data with per-class cap of 25,000 (no temporal windowing), mean ± std over … view at source ↗

**Figure 7.** Figure 7: Realistic evaluation, mean ± std over three seeds. Left: macro-F1 under random vs. group split. Right: drop from the random split per model. The Random Forest and the lightweight RNN/CNN models hold within 0.03 of their random-split macro-F1; the CNN–Transformer drops 0.23, almost entirely attributable to the zero-pad+mask protocol rather than the split itself (cf. Table VII) — a sensitivity that the rando… view at source ↗

**Figure 8.** Figure 8: Per-class recall under the random split (left) vs. the leakage-free group-by-five-tuple split (right), mean across three [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Per-class recall drop (random − group), mean across three seeds. Red cells = recall loss going to the leakagefree split; blue cells = gain. The Bot column drops for every temporal architecture (the Random Forest’s Bot recall actually improves); the two Transformer variants lose recall across more classes than the recurrent and convolutional models do. malised features. The Transformer’s padding sensitivit… view at source ↗

read the original abstract

Recent deep learning approaches for network intrusion detection increasingly incorporate temporal architectures such as recurrent networks and Transformers, often reporting near-perfect performance on CIC-IDS2017. However, many existing studies neither supply their temporal modules with genuine sequence inputs nor evaluate under realistic, leakage-free conditions, making it unclear whether reported gains arise from true sequence-modeling capability. In this work, we reformulate CIC-IDS2017 as a temporal intrusion-detection task by constructing ordered flow sequences from network conversations and benchmarking nine classical and deep learning architectures under a random split, two leakage-free splits, and a padding-scheme ablation. The central finding is that padding convention, not architecture, determines the Transformer's performance: on genuinely sequential (non-padded) windows the Transformer achieves the highest macro-F1 of any model in the experiment (0.89); under zero-pad+mask evaluation it drops markedly (-0.24 macro-F1), while LSTM, GRU, and 1D-CNN remain stable. Under leakage-free group evaluation the Random Forest is the most robust model (+0.009), while the Transformer's false-alarm rate grows from 0.04% to 2.7%, a 67-fold increase invisible under conventional protocols. These findings demonstrate that evaluation methodology -- specifically padding convention and split protocol -- has a larger effect on reported performance than architectural choice, and that widely used random splits with repeat-last padding can overestimate model robustness by up to 0.24 macro-F1. We advocate leakage-free splits, explicit padding disclosure, and sequence-aware benchmarking as standard practice in future IDS research. Code and implementation details are available at https://github.com/zachmocz/temporal-ids-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Padding scheme and leakage-free splits swing Transformer scores far more than architecture does on this IDS task.

read the letter

The main point is that padding convention alone can drop a Transformer's macro-F1 by 0.24 on CIC-IDS2017 while LSTM, GRU, and CNN stay flat, and leakage-free group splits push its false-alarm rate up 67 times. Random splits with repeat-last padding hide both effects.

The paper builds ordered flow sequences from the dataset, runs nine models across random, group, and time-based splits, and adds an explicit padding ablation. That controlled comparison is new; earlier work rarely isolates padding this cleanly or reports the false-alarm jump under realistic splits. Releasing the code is useful and lets others check the windowing and labeling steps.

The central claim holds up on the numbers given. The weakest part is that sequence construction details are only sketched in the abstract, though the repo should resolve that. No circularity or invented metrics appear.

This is for anyone running temporal models on network data who wants to avoid over-optimistic numbers. It deserves referee time because the empirical gaps are large, the ablations are direct, and the takeaway challenges a common practice with reproducible evidence.

Referee Report

0 major / 2 minor

Summary. The paper reformulates CIC-IDS2017 as a temporal task by constructing ordered flow sequences from network conversations and benchmarks nine models (including Transformer, LSTM, GRU, 1D-CNN, and Random Forest) under random splits, two leakage-free group splits, and a padding-scheme ablation. The central claim is that padding convention—not architecture—drives Transformer performance: non-padded sequential windows yield the highest macro-F1 (0.89) for the Transformer, while zero-pad+mask evaluation causes a -0.24 macro-F1 drop (with LSTM/GRU/1D-CNN remaining stable); under leakage-free splits the Random Forest is most robust (+0.009) while the Transformer’s false-alarm rate rises from 0.04% to 2.7%. The work concludes that evaluation methodology has a larger effect on reported performance than architectural choice and advocates leakage-free splits plus explicit padding disclosure.

Significance. If the empirical results hold, the paper demonstrates that widely used random splits with repeat-last padding can overestimate robustness by up to 0.24 macro-F1 and that many prior Transformer claims in IDS may be artifacts of evaluation choices rather than genuine sequence-modeling gains. The explicit ablations, leakage tests via group splits, and public code release constitute a reproducible benchmark that directly supports falsifiable claims about padding sensitivity and split leakage, strengthening the case for revised standards in temporal IDS evaluation.

minor comments (2)

The abstract states that sequences are 'constructed from network conversations' but does not specify the exact windowing parameters or flow aggregation rules; a brief methods paragraph or table listing these choices would aid replication even though the GitHub link is provided.
Figure or table captions could explicitly label the padding condition (non-padded vs. zero-pad+mask) alongside each macro-F1 column to make the ablation comparison immediately visible without cross-referencing text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, accurate summary of our contributions, and recommendation to accept the manuscript. We appreciate the recognition that our ablations on padding and leakage-free splits provide a reproducible benchmark for temporal IDS evaluation.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical benchmarking study that constructs ordered flow sequences from CIC-IDS2017, evaluates nine models under random and leakage-free splits, and performs an explicit padding-scheme ablation. No equations, derivations, or predictions are present that reduce by construction to fitted inputs or self-citations. Central claims rest on reported macro-F1, false-alarm rates, and ablation deltas from direct experiments on a public dataset, with code released. Sequence construction is presented as a standard reformulation rather than a derived result, and leakage testing is performed explicitly via group splits. No steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that CIC-IDS2017 flows can be meaningfully ordered into temporal sequences and that standard ML training procedures apply; no new free parameters or invented entities are introduced beyond standard model hyperparameters.

axioms (1)

domain assumption CIC-IDS2017 dataset provides accurate flow labels and timestamps that permit construction of genuine temporal sequences without label leakage
Invoked when reformulating the dataset as a temporal intrusion-detection task

pith-pipeline@v0.9.1-grok · 5860 in / 1338 out tokens · 28771 ms · 2026-06-27T12:32:11.361329+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 2 canonical work pages · 1 internal anchor

[1]

A survey of network-based intrusion detection data sets,

M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based intrusion detection data sets,”Computers & Security, vol. 86, pp. 147–167,
[2]

A com- prehensive survey on intrusion detection systems with advances in machine learning, deep learning and emerg- ing cybersecurity challenges,

A. Hozouri, A. Mirzaei, and M. Effatparvar, “A com- prehensive survey on intrusion detection systems with advances in machine learning, deep learning and emerg- ing cybersecurity challenges,”Discover Artificial Intelli- gence, vol. 5, p. 314, 2025. 1, 2

2025
[3]

Toward generating a new intrusion detection dataset and intrusion traffic characterization,

I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,”Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP), pp. 108–116, 2018. 1, 2, 8

2018
[4]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 30, 2017. 1, 2, 4

2017
[5]

A transformer-based framework for multi- variate time series representation learning,

G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, “A transformer-based framework for multi- variate time series representation learning,”Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2114–2124, 2021. 1, 2

2021
[6]

Temporal fusion transformers for interpretable multi-horizon time series forecasting,

B. Lim, S. O. Arik, N. Loeff, and T. Pfister, “Temporal fusion transformers for interpretable multi-horizon time series forecasting,”International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021. 1, 2

2021
[7]

Deep learning-based intrusion detection: A CNN-LSTM-Transformer approach for en- hanced network security,

D. Liu, X. Zheng, P. Wang, J. Chuan, Y . Lv, B. Zhou, X. Zan, and W. Jiao, “Deep learning-based intrusion detection: A CNN-LSTM-Transformer approach for en- hanced network security,” inProceedings of the 10th International Conference on Cyber Security and Infor- mation Engineering (ICCSIE), 2025, pp. 319–326. 1, 2, 3, 4, 6

2025
[8]

A CNN-Transformer hybrid approach for an intrusion detection system in advanced metering infrastructure,

R. Yao, N. Wang, P. Chen, D. Ma, and X. Sheng, “A CNN-Transformer hybrid approach for an intrusion detection system in advanced metering infrastructure,” Multimedia Tools and Applications, vol. 82, no. 13, pp. 19 463–19 486, 2023. 1, 2, 3, 4, 6

2023
[9]

Transformers and large language models for efficient intrusion detection systems: A comprehen- sive survey,

H. Kheddar, “Transformers and large language models for efficient intrusion detection systems: A comprehen- sive survey,”Information Fusion, vol. 124, p. 103347,
[10]

FlowTrans- former: A transformer framework for flow-based network intrusion detection systems,

L. D. Manocchio, S. Layeghy, W. W. Lo, G. K. Ku- latilleke, M. Sarhan, and M. Portmann, “FlowTrans- former: A transformer framework for flow-based network intrusion detection systems,”Expert Systems with Appli- cations, vol. 241, p. 122564, 2024. 2

2024
[11]

RTIDS: A robust transformer-based approach for intrusion detection system,

Z. Wu, H. Zhang, P. Wang, and Z. Sun, “RTIDS: A robust transformer-based approach for intrusion detection system,”IEEE Access, vol. 10, pp. 64 375–64 387, 2022. 1, 2

2022
[12]

Troubleshooting an intrusion detection dataset: the CICIDS2017 case study,

G. Engelen, V . Rimmer, and W. Joosen, “Troubleshooting an intrusion detection dataset: the CICIDS2017 case study,”IEEE Security and Privacy Workshops (SPW), pp. 7–12, 2021. 1, 2, 9

2021
[13]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. 2, 3

2001
[14]

Support-vector networks,

C. Cortes and V . Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. 2, 3

1995
[15]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735– 1780, 1997. 2, 3

1997
[16]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,”Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, 2014. 2, 3

2014
[17]

A deep learning approach for intrusion detection using recurrent neural networks,

C. Yin, Y . Zhu, J. Fei, and X. He, “A deep learning approach for intrusion detection using recurrent neural networks,”IEEE Access, vol. 5, pp. 21 954–21 961, 2017. 2

2017
[18]

Introduction to sequence modeling with transformers,

J.-K. K ¨am¨ar¨ainen, “Introduction to sequence modeling with transformers,”arXiv preprint arXiv:2502.19597,

work page arXiv
[19]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep Learn- ing. MIT Press, 2016. 3

2016
[20]

Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift,”Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 448–456, 2015. 3

2015
[21]

Dropout: A simple way to prevent neural networks from overfitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014. 3

1929
[22]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics (NAACL-HLT), 2019, pp. 4171–4186. 4

2019
[23]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal- ization,”arXiv preprint arXiv:1607.06450, 2016. 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Adam: A method for stochas- tic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochas- tic optimization,”International Conference on Learning Representations (ICLR), 2015. 5

2015
[25]

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,

T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,”PLOS ONE, vol. 10, no. 3, p. e0118432, 2015. 4

2015

[1] [1]

A survey of network-based intrusion detection data sets,

M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based intrusion detection data sets,”Computers & Security, vol. 86, pp. 147–167,

[2] [2]

A com- prehensive survey on intrusion detection systems with advances in machine learning, deep learning and emerg- ing cybersecurity challenges,

A. Hozouri, A. Mirzaei, and M. Effatparvar, “A com- prehensive survey on intrusion detection systems with advances in machine learning, deep learning and emerg- ing cybersecurity challenges,”Discover Artificial Intelli- gence, vol. 5, p. 314, 2025. 1, 2

2025

[3] [3]

Toward generating a new intrusion detection dataset and intrusion traffic characterization,

I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,”Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP), pp. 108–116, 2018. 1, 2, 8

2018

[4] [4]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Infor- mation Processing Systems (NeurIPS), vol. 30, 2017. 1, 2, 4

2017

[5] [5]

A transformer-based framework for multi- variate time series representation learning,

G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, “A transformer-based framework for multi- variate time series representation learning,”Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2114–2124, 2021. 1, 2

2021

[6] [6]

Temporal fusion transformers for interpretable multi-horizon time series forecasting,

B. Lim, S. O. Arik, N. Loeff, and T. Pfister, “Temporal fusion transformers for interpretable multi-horizon time series forecasting,”International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021. 1, 2

2021

[7] [7]

Deep learning-based intrusion detection: A CNN-LSTM-Transformer approach for en- hanced network security,

D. Liu, X. Zheng, P. Wang, J. Chuan, Y . Lv, B. Zhou, X. Zan, and W. Jiao, “Deep learning-based intrusion detection: A CNN-LSTM-Transformer approach for en- hanced network security,” inProceedings of the 10th International Conference on Cyber Security and Infor- mation Engineering (ICCSIE), 2025, pp. 319–326. 1, 2, 3, 4, 6

2025

[8] [8]

A CNN-Transformer hybrid approach for an intrusion detection system in advanced metering infrastructure,

R. Yao, N. Wang, P. Chen, D. Ma, and X. Sheng, “A CNN-Transformer hybrid approach for an intrusion detection system in advanced metering infrastructure,” Multimedia Tools and Applications, vol. 82, no. 13, pp. 19 463–19 486, 2023. 1, 2, 3, 4, 6

2023

[9] [9]

Transformers and large language models for efficient intrusion detection systems: A comprehen- sive survey,

H. Kheddar, “Transformers and large language models for efficient intrusion detection systems: A comprehen- sive survey,”Information Fusion, vol. 124, p. 103347,

[10] [10]

FlowTrans- former: A transformer framework for flow-based network intrusion detection systems,

L. D. Manocchio, S. Layeghy, W. W. Lo, G. K. Ku- latilleke, M. Sarhan, and M. Portmann, “FlowTrans- former: A transformer framework for flow-based network intrusion detection systems,”Expert Systems with Appli- cations, vol. 241, p. 122564, 2024. 2

2024

[11] [11]

RTIDS: A robust transformer-based approach for intrusion detection system,

Z. Wu, H. Zhang, P. Wang, and Z. Sun, “RTIDS: A robust transformer-based approach for intrusion detection system,”IEEE Access, vol. 10, pp. 64 375–64 387, 2022. 1, 2

2022

[12] [12]

Troubleshooting an intrusion detection dataset: the CICIDS2017 case study,

G. Engelen, V . Rimmer, and W. Joosen, “Troubleshooting an intrusion detection dataset: the CICIDS2017 case study,”IEEE Security and Privacy Workshops (SPW), pp. 7–12, 2021. 1, 2, 9

2021

[13] [13]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. 2, 3

2001

[14] [14]

Support-vector networks,

C. Cortes and V . Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. 2, 3

1995

[15] [15]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735– 1780, 1997. 2, 3

1997

[16] [16]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,”Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, 2014. 2, 3

2014

[17] [17]

A deep learning approach for intrusion detection using recurrent neural networks,

C. Yin, Y . Zhu, J. Fei, and X. He, “A deep learning approach for intrusion detection using recurrent neural networks,”IEEE Access, vol. 5, pp. 21 954–21 961, 2017. 2

2017

[18] [18]

Introduction to sequence modeling with transformers,

J.-K. K ¨am¨ar¨ainen, “Introduction to sequence modeling with transformers,”arXiv preprint arXiv:2502.19597,

work page arXiv

[19] [19]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep Learn- ing. MIT Press, 2016. 3

2016

[20] [20]

Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift,”Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 448–456, 2015. 3

2015

[21] [21]

Dropout: A simple way to prevent neural networks from overfitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014. 3

1929

[22] [22]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics (NAACL-HLT), 2019, pp. 4171–4186. 4

2019

[23] [23]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal- ization,”arXiv preprint arXiv:1607.06450, 2016. 4

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Adam: A method for stochas- tic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochas- tic optimization,”International Conference on Learning Representations (ICLR), 2015. 5

2015

[25] [25]

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,

T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,”PLOS ONE, vol. 10, no. 3, p. e0118432, 2015. 4

2015