Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports
Pith reviewed 2026-06-26 23:59 UTC · model grok-4.3
The pith
Open-source LLMs reach only 0.22 micro F1 on multi-label ATT&CK classification of real CTI reports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a ground-truth dataset of 2,076 human-annotated sentences from 83 complex unstructured CTI reports mapped to 114 unique ATT&CK techniques, evaluation of seven open-source LLMs shows a highest micro-averaged F1 score of 0.22, establishing an empirical baseline and indicating that current open-source LLMs are insufficient for production-grade multi-label ATT&CK classification on complex CTI.
What carries the argument
A six-phase human-annotated dataset of 2,076 sentences from real CTI reports, each carrying multi-label ATT&CK technique assignments, used as the test bed for LLM performance across parameter sizes, prompts, and temperatures.
If this is right
- Parameter count correlates positively with F1 score across the tested models.
- Neither prompt strategy nor temperature setting yields statistically significant gains.
- The 0.22 micro F1 result marks the performance level current open-source LLMs achieve on realistic multi-label ATT&CK tasks.
- The released dataset and benchmark supply a fixed reference point for measuring future progress.
Where Pith is reading between the lines
- Improving annotation agreement beyond kappa 0.68 could raise the measurable ceiling for any automated system.
- Hybrid approaches that combine LLMs with rule-based or retrieval components might exceed the reported baseline without requiring larger models.
- The gap between this result and production needs points to a requirement for domain-specific fine-tuning or architectural changes tailored to attack-pattern language.
Load-bearing premise
The six-phase annotation process on 83 reports produces labels accurate and representative enough of real-world CTI complexity to serve as ground truth.
What would settle it
An independent run on a fresh collection of complex CTI reports, using the same annotation protocol, in which any open-source LLM exceeds 0.30 micro F1 would falsify the claim that 0.22 represents the current ceiling.
read the original abstract
Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive) drawn from 83 complex unstructured CTI reports. These are mapped to 114 ATT&CK techniques via a six-phase annotation process yielding kappa = 0.68. Seven open-source LLMs (8B–236B parameters) are evaluated on multi-label classification under varied prompt strategies and temperatures; the best model reaches micro-averaged F1 = 0.22. The authors conclude that current open-source LLMs remain insufficient for production-grade ATT&CK classification on real CTI, while reporting a statistically significant positive correlation between parameter count and F1 and no significant effect from prompt or temperature.
Significance. If the annotations constitute reliable ground truth, the work supplies the first empirical baseline for LLM performance on realistic multi-label ATT&CK classification from complex CTI reports. Prior evaluations on simplified single-technique sentences likely overstated capabilities; this dataset therefore supplies a reproducible foundation for measuring progress toward production use. The size–performance correlation and null results for prompting/temperature are also actionable for the community.
major comments (1)
- [Dataset Construction / Annotation section] Dataset Construction / Annotation section: the central claim that micro-F1 = 0.22 demonstrates that open-source LLMs are insufficient for production-grade classification rests on the 2,076 human labels being a low-noise proxy for the true techniques. With kappa = 0.68 on a 114-class multi-label task, even modest per-sentence disagreement can inject label noise sufficient to cap observable F1 near the reported value. The manuscript should either (a) quantify the effective upper bound on F1 given the observed agreement (e.g., via label-flip simulation on the resolved annotations) or (b) qualify the interpretation to acknowledge that part of the performance gap may be attributable to annotation ambiguity rather than model limitations.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on annotation reliability and its implications for interpreting our results. We address the major comment point-by-point below.
read point-by-point responses
-
Referee: [Dataset Construction / Annotation section] Dataset Construction / Annotation section: the central claim that micro-F1 = 0.22 demonstrates that open-source LLMs are insufficient for production-grade classification rests on the 2,076 human labels being a low-noise proxy for the true techniques. With kappa = 0.68 on a 114-class multi-label task, even modest per-sentence disagreement can inject label noise sufficient to cap observable F1 near the reported value. The manuscript should either (a) quantify the effective upper bound on F1 given the observed agreement (e.g., via label-flip simulation on the resolved annotations) or (b) qualify the interpretation to acknowledge that part of the performance gap may be attributable to annotation ambiguity rather than model limitations.
Authors: We agree that kappa = 0.68 on this complex multi-label task implies some label noise that could contribute to the observed F1 ceiling, and that our central claim would benefit from qualification. Our six-phase annotation process (including independent annotation, conflict resolution, and final review) was intended to minimize ambiguity, yet we acknowledge that residual disagreement remains a factor. We will revise the manuscript to adopt option (b): we will add explicit language in the Discussion section qualifying the interpretation to note that part of the performance gap may be attributable to annotation ambiguity rather than model limitations alone. This provides a balanced view of the results without overstating the evidence for model insufficiency. revision: yes
Circularity Check
No circularity: pure empirical measurement on external human annotations
full rationale
The paper constructs an external ground-truth dataset via six-phase human annotation of 83 CTI reports (2,076 sentences, kappa=0.68) and reports micro-F1 scores of open-source LLMs against those fixed labels. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear. The reported F1=0.22 is a direct statistical comparison to the independently produced annotations and does not reduce to any quantity defined inside the paper itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six-phase annotation process on sentences from 83 CTI reports yields sufficiently accurate labels for benchmarking (kappa = 0.68).
Reference graph
Works this paper leans on
-
[1]
Estimated cost of cybercrime worldwide 2018–2029,
Statista Research Department, “Estimated cost of cybercrime worldwide 2018–2029,” https://www.statista.com/forecasts/1280009/ cost-cybercrime-worldwide, 2026
arXiv 2018
-
[2]
MITRE ATT&CK: Design and philosophy,
B. E. Stromet al., “MITRE ATT&CK: Design and philosophy,” The MITRE Corporation, Tech. Rep. MP180360R1, 2018. [Online]. Available: https://attack.mitre.org/docs/ATTACK_Design_ and_Philosophy_March_2020.pdf
2018
-
[3]
Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,
E. M. Hutchinset al., “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,” Lockheed Martin Corporation, White Paper, 2011. [Online]. Avail- able: https://www.lockheedmartin.com/content/dam/lockheed-martin/ rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf
2011
-
[4]
The diamond model of intrusion analysis,
S. Caltagironeet al., “The diamond model of intrusion analysis,” Center for Cyber Threat Intelligence and Threat Research, Technical Report ADA586960, 2013. [Online]. Available: https://apps.dtic.mil/sti/ citations/ADA586960
2013
-
[5]
The vocabulary for event recording and incident sharing (VERIS) framework,
Verizon Risk Team, “The vocabulary for event recording and incident sharing (VERIS) framework,” https://verisframework.org, 2010
2010
-
[6]
Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,
P. Wanget al., “Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,” IEEE Access, 2026
2026
-
[7]
Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,
P. N. Wudaliet al., “Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,”arXiv preprint arXiv:2502.02337, 2025
arXiv 2025
-
[8]
From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,
M. R. Rahman and L. Williams, “From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,” 2022
2022
-
[9]
What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,
M. R. Rahmanet al., “What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,”ACM Comput. Surv., vol. 55, no. 12, pp. 1–36, 2023
2023
-
[10]
A survey of large language models,
W. X. Zhaoet al., “A survey of large language models,” 2023
2023
-
[11]
CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,
M. T. Alamet al., “CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,” inProc. NeurIPS 2024, Datasets and Benchmarks Track. Curran Associates, Inc., 2024, pp. 50 805–50 825
2024
-
[12]
TRAM: Threat report ATT&CK mapper,
Center for Threat-Informed Defense, “TRAM: Threat report ATT&CK mapper,” https://github.com/center-for-threat-informed-defense/tram, 2020
2020
-
[13]
Hierarchical RAG for adversarial technique annota- tion,
F. Morbiatoet al., “Hierarchical RAG for adversarial technique annota- tion,” 2026
2026
-
[14]
What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,
M. Tamannaet al., “What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,” 2026
2026
-
[15]
Chain-of-thought prompting elicits reasoning in large language models,
J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS 2022. Curran Associates, Inc., 2022, pp. 24 824–24 837
2022
-
[16]
Language models are few-shot learners,
T. B. Brownet al., “Language models are few-shot learners,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 1877–1901
2020
-
[17]
D. C. Montgomery,Design and Analysis of Experiments, 10th ed. Hoboken, NJ: John Wiley & Sons, 2020
2020
-
[18]
Retrieval-augmented generation for knowledge-intensive NLP tasks,
P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 9459–9474
2020
-
[19]
LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,
Y . Schwartzet al., “LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,” inCompanion Proc. ACM WWW 2025. ACM, 2025
2025
-
[20]
Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,
V . Legoyet al., “Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,”arXiv preprint arXiv:2004.14322, 2020
arXiv 2004
-
[21]
CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,
S. Della Pennaet al., “CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,” 2025
2025
-
[22]
TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,
G. Husariet al., “TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,” inProc. ACSAC 2017. ACM, 2017, pp. 103–115
2017
-
[23]
AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,
Z. Liet al., “AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,” inProc. ESORICS 2022. Springer, 2022, pp. 589–609
2022
-
[24]
SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,
B. Abdeenet al., “SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,” inProc. DBSec 2023. Springer, 2023, pp. 243–260
2023
-
[25]
SecureBERT: A domain-specific language model for cybersecurity,
E. Aghaeiet al., “SecureBERT: A domain-specific language model for cybersecurity,” inProc. SecureComm 2022. Springer, 2023, pp. 39–56
2022
-
[26]
FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,
S. Mitraet al., “FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,” 2025
2025
-
[27]
Large language models are unreliable for cyber threat intelligence,
E. Mezziet al., “Large language models are unreliable for cyber threat intelligence,” 2025
2025
-
[28]
The DFIR report — real intrusions by real attackers,
The DFIR Report, “The DFIR report — real intrusions by real attackers,” https://thedfirreport.com, 2024
2024
-
[29]
Scrapy: An open source and collaborative framework for extracting data from websites,
Zyte and Scrapy Developers, “Scrapy: An open source and collaborative framework for extracting data from websites,” https://scrapy.org, 2024
2024
-
[30]
Beautiful Soup: A python library for pulling data out of HTML and XML files,
L. Richardson, “Beautiful Soup: A python library for pulling data out of HTML and XML files,” https://www.crummy.com/software/ BeautifulSoup/, 2024
2024
-
[31]
Newspaper3k: Article scraping and curation,
L. Ou-Yang, “Newspaper3k: Article scraping and curation,” https:// github.com/codelucas/newspaper, 2024
2024
-
[32]
A coefficient of agreement for nominal scales,
J. Cohen, “A coefficient of agreement for nominal scales,”Educ. Psychol. Meas., vol. 20, no. 1, pp. 37–46, 1960
1960
-
[33]
The measurement of observer agreement for categorical data,
J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977
1977
-
[34]
The curious case of neural text degeneration,
A. Holtzmanet al., “The curious case of neural text degeneration,” inProc. ICLR 2020, 2020. [Online]. Available: https://openreview.net/ forum?id=rygGQyrFvH
2020
-
[35]
Scaling laws for neural language models,
J. Kaplanet al., “Scaling laws for neural language models,” 2020
2020
-
[36]
The Llama 3 herd of models,
A. Dubeyet al., “The Llama 3 herd of models,” 2024
2024
-
[37]
Gemma 2: Improving open language models at a practical size,
Gemma Teamet al., “Gemma 2: Improving open language models at a practical size,” 2024
2024
-
[38]
DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,
DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024
2024
-
[39]
DeepSeek-V2.5 model repository,
——, “DeepSeek-V2.5 model repository,” https://huggingface.co/ deepseek-ai/DeepSeek-V2.5, 2024
2024
-
[40]
GPT-OSS-120B model repository,
OpenAI, “GPT-OSS-120B model repository,” https://huggingface.co/ openai/gpt-oss-120b, 2024
2024
-
[41]
Meta-Llama-3.1-70B-Instruct model repository,
Meta AI, “Meta-Llama-3.1-70B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct, 2024
2024
-
[42]
Gemma-3-27B model repository,
Google DeepMind, “Gemma-3-27B model repository,” https://huggingface.co/google/gemma-3-27b-it, 2024
2024
-
[43]
GPT-OSS-20B model repository,
OpenAI, “GPT-OSS-20B model repository,” https://huggingface.co/ openai/gpt-oss-20b, 2024
2024
-
[44]
Gemma-3-12B model repository,
Google DeepMind, “Gemma-3-12B model repository,” https://huggingface.co/google/gemma-3-12b-it, 2024
2024
-
[45]
Meta-Llama-3.1-8B-Instruct model repository,
Meta AI, “Meta-Llama-3.1-8B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, 2024
2024
-
[46]
Faiss: A library for efficient similarity search and clustering of dense vectors,
Meta Platforms, Inc., “Faiss: A library for efficient similarity search and clustering of dense vectors,” https://faiss.ai, 2024
2024
-
[47]
The proof and measurement of association between two things,
C. Spearman, “The proof and measurement of association between two things,”Am. J. Psychol., vol. 15, no. 1, pp. 72–101, 1904
1904
-
[48]
TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,
N. Raniet al., “TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,” inProc. ACSW 2023. ACM, 2023, pp. 126–134
2023
-
[49]
TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,
——, “TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,”Digit. Threats Res. Pract., vol. 5, no. 4, pp. 1–19, 2024
2024
-
[50]
SoK: Automated TTP extraction from CTI reports,
M. Büchelet al., “SoK: Automated TTP extraction from CTI reports,” in Proc. USENIX Security 2025. USENIX Association, 2025, pp. 4621– 4641
2025
-
[51]
Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,
H. C. Nguyenet al., “Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,” 2025
2025
-
[52]
Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,
M. N. Haqueet al., “Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,” in Proc. ASE 2026. IEEE/ACM, 2026
2026
-
[53]
Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,
R. Fayyaziet al., “Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,” 2024
2024
-
[54]
Learning and evaluation in the presence of class hierarchies: Application to text categorization,
S. Kiritchenkoet al., “Learning and evaluation in the presence of class hierarchies: Application to text categorization,” inProc. Canadian AI
-
[55]
Springer, 2006, pp. 395–406
2006
-
[56]
Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,
A. Kosmopouloset al., “Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,”Data Min. Knowl. Discov., vol. 29, no. 3, pp. 820–865, 2015
2015
-
[57]
CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,
M. Faliset al., “CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,” inProc. EMNLP
-
[58]
Association for Computational Linguistics, 2021, pp. 907–912
2021
-
[59]
Mining temporal attack patterns from cyberthreat intelligence reports,
M. R. Rahmanet al., “Mining temporal attack patterns from cyberthreat intelligence reports,”Knowl. Inf. Syst., vol. 67, no. 10, pp. 8941–8981, 2025
2025
-
[60]
ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,
——, “ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,” inProc. IEEE ICDM 2024. IEEE, 2024, pp. 420–429
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.