Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation
Pith reviewed 2026-06-26 23:54 UTC · model grok-4.3
The pith
A new multi-source log dataset with per-entry ATT&CK labels lets fine-tuned small language models classify attack chunks at 90-97 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
No existing public dataset supplies simultaneous system, network, and browser logs with per-entry ATT&CK technique labels. The introduced collection of 870 sessions and 2.3 million events, generated on Windows endpoints and labeled with 53 techniques from real attack tools, enables fine-tuned small language models to classify log chunks at 90-97 percent accuracy and identify techniques at up to 42 percent exact match.
What carries the argument
The ATT&CK-labeled multi-source log dataset, which supplies the training examples needed for models to correlate events across system, network, and browser sources.
If this is right
- Fine-tuned models can detect multi-stage attacks by learning patterns that span system, network, and browser logs simultaneously.
- The dataset supports training for both broad chunk classification and granular ATT&CK technique identification.
- Performance gains from LoRA fine-tuning hold across three different small language model architectures.
- High partial-match scores indicate models capture underlying attack reasoning even when exact technique labels are missed.
Where Pith is reading between the lines
- Security teams could incorporate the dataset into monitoring pipelines to improve detection of attacks that involve browser activity.
- The exact-match limit of 42 percent on technique identification points to a need for additional data or model refinements to reach production reliability.
- Extending the dataset with more attack variants or automated labeling could scale its use for broader model training.
Load-bearing premise
The 70 author-generated attack sessions using real tools represent the distribution and labeling quality of actual multi-stage cyberattacks in the wild.
What would settle it
An independent collection of real-world multi-source logs from actual incidents, labeled by experts, on which models trained solely on this dataset show low accuracy in chunk classification or technique identification.
read the original abstract
Multi-stage cyberattacks span system, network, and browser logs. Detecting them requires correlating events across all three sources. Machine learning methods can learn these cross-source patterns, but they need labeled multi-source data. Existing public datasets fall short. Network-only datasets such as CICIDS and UNSW-NB15 miss host and browser activity. Host-focused datasets such as LMDG and CICAPT-IIoT lack browser telemetry. ATLAS includes all three sources but labels events only as malicious or benign, without MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) technique granularity. No public dataset combines all three sources with per-entry ATT&CK technique labels. We close the gap by building a multi-source log dataset of 870 sessions (70 attack, 800 benign) and approximately 2.3 million events. We captured system, network, and browser activity simultaneously on Windows endpoints. We labeled malicious events with ATT&CK technique IDs, covering 12 tactics and 53 techniques. We generated all attack data using real tools, including Remote Access Trojan (RAT), Command and Control (C2) tunnels, and cloud exfiltration. To demonstrate learnability, we fine-tuned three Small Language Models (SLMs) (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) using Low-Rank Adaptation (LoRA). We compared each against its base variant across ten metrics on two tasks: chunk classification and ATT&CK technique identification. Fine-tuning improved every model on every metric. Chunk classification accuracy rose from approximately 8% in the base variants to between 90% and 97% after fine-tuning. Technique identification remained challenging, with the best exact-match accuracy at 42%, although high partial-match scores show the models captured most of the underlying reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address the lack of public multi-source cybersecurity log datasets with per-entry ATT&CK technique labels by constructing a dataset of 870 sessions (70 attack, 800 benign) comprising ~2.3 million events from system, network, and browser logs on Windows. Attacks are generated using real tools like RAT and C2, labeled with 12 tactics and 53 techniques. They fine-tune three SLMs with LoRA and show improvements in chunk classification (base ~8% to 90-97%) and technique identification (up to 42% exact match).
Significance. If the labels prove reliable, this dataset would fill a documented gap left by network-only (CICIDS, UNSW-NB15), host-only (LMDG), and coarsely labeled (ATLAS) resources, enabling cross-source correlation at technique granularity. The consistent gains across three SLMs after LoRA fine-tuning provide concrete evidence that the collected data supports supervised learning, which is a strength of the empirical component.
major comments (3)
- [Data generation and labeling] Data generation and labeling section: No inter-annotator agreement, label-validation procedure, or external review of the ATT&CK assignments is described. Because the 70 attack sessions were both generated and labeled internally, the absence of these metrics directly affects the trustworthiness of the per-event technique labels that underpin both the dataset contribution and the fine-tuning results.
- [Experimental evaluation] Experimental evaluation section: The manuscript provides no information on the train/test split of sessions or events, the treatment of class imbalance (800 benign vs. 70 attack), or coverage statistics across the 53 techniques. These details are load-bearing for interpreting whether the reported jumps (chunk accuracy 8 % o 90–97 %, exact-match 42 %) reflect genuine cross-source pattern learning rather than overfitting to the authors’ synthetic distribution.
- [Results] Results section: The 42 % exact-match figure for technique identification is presented without an ablation on label granularity or a comparison against a non-LLM baseline; combined with the lack of diversity analysis versus real incident reports, this weakens the claim that the fine-tuned models have captured generalizable ATT&CK reasoning.
minor comments (2)
- [Abstract] Abstract: the phrase 'high partial-match scores' should be accompanied by the precise definition or scoring rule used for partial matches so readers can assess what the models actually learned.
- [Introduction] Introduction: verify that every cited dataset (CICIDS, UNSW-NB15, LMDG, CICAPT-IIoT, ATLAS) appears in the reference list with complete bibliographic details.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [Data generation and labeling] Data generation and labeling section: No inter-annotator agreement, label-validation procedure, or external review of the ATT&CK assignments is described. Because the 70 attack sessions were both generated and labeled internally, the absence of these metrics directly affects the trustworthiness of the per-event technique labels that underpin both the dataset contribution and the fine-tuning results.
Authors: We acknowledge that the manuscript does not describe inter-annotator agreement, formal validation procedures, or external review. Labeling was performed internally by the authors by mapping observed tool behaviors (RAT, C2, exfiltration) to ATT&CK technique definitions from the official matrix. To address the concern, we will expand the Data generation and labeling section with a detailed description of the mapping process, including examples of how specific events were assigned technique IDs and any consistency checks performed among co-authors. We cannot retroactively add IAA metrics that were not collected, but the public release of the dataset will enable external validation. revision: partial
-
Referee: [Experimental evaluation] Experimental evaluation section: The manuscript provides no information on the train/test split of sessions or events, the treatment of class imbalance (800 benign vs. 70 attack), or coverage statistics across the 53 techniques. These details are load-bearing for interpreting whether the reported jumps (chunk accuracy 8 % to 90–97 %, exact-match 42 %) reflect genuine cross-source pattern learning rather than overfitting to the authors’ synthetic distribution.
Authors: The referee correctly identifies missing details. We will add a dedicated subsection to Experimental evaluation that specifies: the session-based train/validation/test split (preventing event leakage across sources), the method used to handle imbalance during LoRA fine-tuning (e.g., class-weighted loss), and per-technique coverage counts showing how many of the 53 techniques appear in the 70 attack sessions. These additions will allow readers to evaluate whether the accuracy gains reflect cross-source learning. revision: yes
-
Referee: [Results] Results section: The 42 % exact-match figure for technique identification is presented without an ablation on label granularity or a comparison against a non-LLM baseline; combined with the lack of diversity analysis versus real incident reports, this weakens the claim that the fine-tuned models have captured generalizable ATT&CK reasoning.
Authors: We agree that additional analyses would strengthen the results section. We will add (1) an ablation comparing exact-match performance at technique versus tactic granularity and (2) a non-LLM baseline (e.g., a feature-based classifier on aggregated log statistics). A quantitative diversity comparison against real incident reports is not feasible without equivalently labeled multi-source real-world data, which does not currently exist at this granularity; we will instead add a qualitative discussion mapping our generated attacks to techniques frequently cited in public threat reports. revision: partial
Circularity Check
Empirical dataset construction and SLM evaluation with no circular derivations
full rationale
The paper describes capturing multi-source logs, author labeling of 70 attack sessions with ATT&CK IDs, and empirical fine-tuning/evaluation of three SLMs using LoRA, reporting direct accuracy metrics. No equations, fitted parameters renamed as predictions, self-citations load-bearing on uniqueness or ansatzes, or derivations that reduce to author-defined inputs by construction appear in the text. The central claims rest on observable data collection and measured performance deltas, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MITRE ATT&CK provides a stable, externally maintained taxonomy suitable for per-event labeling of malicious activity.
Reference graph
Works this paper leans on
-
[1]
Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,
E. M. Hutchins, M. J. Cloppert, and R. M. Amin, “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,”Leading Issues in Information Warfare & Security Research, vol. 1, no. 1, p. 80, 2011
2011
-
[2]
MITRE ATT&CK: Design and philosophy,
B. E. Stromet al., “MITRE ATT&CK: Design and philosophy,” The MITRE Corporation, Tech. Rep. MP180360R1, 2020
2020
-
[3]
#StopRansomware: RansomHub Ran- somware,
CISA, FBI, MS-ISAC, and HHS, “#StopRansomware: RansomHub Ran- somware,” CISA, Tech. Rep. AA24-242A, 2024. [Online]. Available: https://www.cisa.gov/news-events/cybersecurity-advisories/aa24-242a
2024
-
[4]
2026 global threat report,
CrowdStrike, “2026 global threat report,” CrowdStrike, Tech. Rep., 2026. [Online]. Available: https://www.crowdstrike.com/en-us/ global-threat-report/
2026
-
[5]
A detailed analysis of the KDD CUP 99 data set,
M. Tavallaeeet al., “A detailed analysis of the KDD CUP 99 data set,” in Proc. IEEE Symp. Computational Intelligence for Security and Defense Applications (CISDA), 2009, pp. 1–6
2009
-
[6]
Toward generating a new intrusion detection dataset and intrusion traffic characterization,
I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” in Proc. 4th Int. Conf. Information Systems Security and Privacy (ICISSP), 2018, pp. 108–116
2018
-
[7]
UNSW-NB15: A comprehensive data set for network intrusion detection systems,
N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion detection systems,” inProc. Military Communications and Information Systems Conf. (MilCIS), 2015, pp. 1–6
2015
-
[8]
An empirical comparison of botnet detection methods,
S. Garcíaet al., “An empirical comparison of botnet detection methods,” Computers & Security, vol. 45, pp. 100–123, 2014
2014
-
[9]
DAPT 2020 – constructing a benchmark dataset for advanced persistent threats,
S. Myneniet al., “DAPT 2020 – constructing a benchmark dataset for advanced persistent threats,” inDeployable Machine Learning for Security Defense. Springer, 2020, pp. 138–163
2020
-
[10]
Unraveled – a semi-synthetic dataset for advanced persistent threats,
——, “Unraveled – a semi-synthetic dataset for advanced persistent threats,”Computer Networks, vol. 227, p. 109688, 2023
2023
-
[11]
CICAPT-IIoT: A provenance-based APT attack dataset for IIoT environment,
E. Ghiasvandet al., “CICAPT-IIoT: A provenance-based APT attack dataset for IIoT environment,”arXiv:2407.11278, 2024
arXiv 2024
-
[12]
LMDG: Advancing lateral movement detection through high-fidelity dataset generation,
A. Mabrouk, M. Hatem, M. Mamun, and S. Saad, “LMDG: Advancing lateral movement detection through high-fidelity dataset generation,” arXiv:2508.02942, 2025
arXiv 2025
-
[13]
ATLAS: A sequence-based learning approach for attack investigation,
A. Alsaheelet al., “ATLAS: A sequence-based learning approach for attack investigation,” inProc. 30th USENIX Security Symp., 2021, pp. 3005–3022. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity21/presentation/alsaheel
2021
-
[14]
ATLASv2: ATLAS attack engagements, version 2,
A. Riddle, K. Westfall, and A. Bates, “ATLASv2: ATLAS attack engagements, version 2,”arXiv:2401.01341, 2024
arXiv 2024
-
[15]
Introducing UWF-ZeekData22: A comprehensive network traffic dataset based on the MITRE ATT&CK framework,
S. S. Baguiet al., “Introducing UWF-ZeekData22: A comprehensive network traffic dataset based on the MITRE ATT&CK framework,” Data, vol. 8, no. 1, p. 18, 2023
2023
-
[16]
A comprehensive survey of small language models in the era of large language models,
F. Wanget al., “A comprehensive survey of small language models in the era of large language models,”arXiv:2411.03350, 2024
arXiv 2024
-
[17]
LoRA: Low-rank adaptation of large language models,
E. J. Huet al., “LoRA: Low-rank adaptation of large language models,” inInt. Conf. Learning Representations (ICLR), 2022
2022
-
[18]
Transparent computing engagement data release,
DARPA, “Transparent computing engagement data release,” https:// github.com/darpa-i2o/Transparent-Computing, 2020
2020
-
[19]
Analyzing the usefulness of the DARPA OpTC dataset in cyber threat detection research,
M. M. Anjum, S. Iqbal, and B. Hamelin, “Analyzing the usefulness of the DARPA OpTC dataset in cyber threat detection research,” inProc. 26th ACM Symp. Access Control Models and Technologies (SACMAT), 2021, pp. 27–32
2021
-
[20]
Unified host and network data set,
M. J. M. Turcotte, A. D. Kent, and C. Hash, “Unified host and network data set,” inData Science for Cyber-Security. World Scientific, 2018, pp. 1–22
2018
-
[21]
A survey of large language models for cyber threat detection,
Y . Chenet al., “A survey of large language models for cyber threat detection,”Computers & Security, vol. 145, p. 104016, 2024
2024
-
[22]
LogGPT: Log anomaly detection via GPT,
X. Han, S. Yuan, and M. Trabelsi, “LogGPT: Log anomaly detection via GPT,” inProc. IEEE Int. Conf. Big Data (BigData), 2023, pp. 1117– 1122
2023
-
[23]
LLMs cannot reliably identify and reason about security vulnerabilities (yet?),
S. Ullahet al., “LLMs cannot reliably identify and reason about security vulnerabilities (yet?),” inProc. IEEE Symp. Security and Privacy (SP), 2024, pp. 862–880
2024
-
[24]
SecVulEval: Benchmarking LLMs for real- world C/C++ vulnerability detection,
M. B. U. Ahmedet al., “SecVulEval: Benchmarking LLMs for real- world C/C++ vulnerability detection,”arXiv:2505.19828, 2025
arXiv 2025
-
[25]
QLoRA: Efficient finetuning of quantized LLMs,
T. Dettmerset al., “QLoRA: Efficient finetuning of quantized LLMs,” in Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[26]
TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,
G. Husariet al., “TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,” inProc. 33rd Annual Computer Security Applications Conf. (ACSAC), 2017, pp. 103–115
2017
-
[27]
Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,
V . Legoyet al., “Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,”arXiv:2004.14322, 2020
arXiv 2004
-
[28]
AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,
Z. Li, J. Zeng, Y . Chen, and Z. Liang, “AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,” inComputer Security – ESORICS 2022, vol. 13554. Springer, 2022, pp. 589–609
2022
-
[29]
Looking beyond IoCs: Automatically extracting attack patterns from external CTI,
M. T. Alamet al., “Looking beyond IoCs: Automatically extracting attack patterns from external CTI,” inProc. 26th Int. Symp. Research in Attacks, Intrusions and Defenses (RAID), 2023
2023
-
[30]
TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,
N. Raniet al., “TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,”ACM Digital Threats: Research and Practice, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.