Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

Ahmed Ryan; Md Erfan; Md Rayhanur Rahman; Saad Sakib Noor; Shaswata Mitra; Sudip Mittal

arxiv: 2606.18166 · v1 · pith:EFK2T6NEnew · submitted 2026-06-16 · 💻 cs.CR · cs.LG

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

Ahmed Ryan , Saad Sakib Noor , Md Erfan , Shaswata Mitra , Sudip Mittal , Md Rayhanur Rahman This is my paper

Pith reviewed 2026-06-26 23:59 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords LLM evaluationATT&CK classificationCyber Threat IntelligenceMulti-label classificationCTI reportsOpen-source modelsTechnique labeling

0 comments

The pith

Open-source LLMs reach only 0.22 micro F1 on multi-label ATT&CK classification of real CTI reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset of 2,076 sentences drawn from 83 complex unstructured CTI reports and maps them via human annotation to 114 ATT&CK techniques. It then runs seven open-source LLMs on the task of assigning multiple technique labels to each sentence. The strongest result is a micro-averaged F1 of 0.22, with model size showing a positive correlation to performance while prompt choice and temperature do not. This supplies the first empirical baseline for the harder, more realistic version of the problem and indicates that current models remain below production thresholds.

Core claim

Using a ground-truth dataset of 2,076 human-annotated sentences from 83 complex unstructured CTI reports mapped to 114 unique ATT&CK techniques, evaluation of seven open-source LLMs shows a highest micro-averaged F1 score of 0.22, establishing an empirical baseline and indicating that current open-source LLMs are insufficient for production-grade multi-label ATT&CK classification on complex CTI.

What carries the argument

A six-phase human-annotated dataset of 2,076 sentences from real CTI reports, each carrying multi-label ATT&CK technique assignments, used as the test bed for LLM performance across parameter sizes, prompts, and temperatures.

If this is right

Parameter count correlates positively with F1 score across the tested models.
Neither prompt strategy nor temperature setting yields statistically significant gains.
The 0.22 micro F1 result marks the performance level current open-source LLMs achieve on realistic multi-label ATT&CK tasks.
The released dataset and benchmark supply a fixed reference point for measuring future progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving annotation agreement beyond kappa 0.68 could raise the measurable ceiling for any automated system.
Hybrid approaches that combine LLMs with rule-based or retrieval components might exceed the reported baseline without requiring larger models.
The gap between this result and production needs points to a requirement for domain-specific fine-tuning or architectural changes tailored to attack-pattern language.

Load-bearing premise

The six-phase annotation process on 83 reports produces labels accurate and representative enough of real-world CTI complexity to serve as ground truth.

What would settle it

An independent run on a fresh collection of complex CTI reports, using the same annotation protocol, in which any open-source LLM exceeds 0.30 micro F1 would falsify the claim that 0.22 represents the current ceiling.

read the original abstract

Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets a needed empirical floor at 0.22 micro F1 for open-source LLMs on full unstructured CTI reports with multi-label ATT&CK tagging.

read the letter

The core result is that the best open-source model reaches only 0.22 micro F1 on 83 real CTI reports mapped to 114 techniques across 2076 sentences. That number is new because earlier work used single-sentence, single-label inputs that hid the actual difficulty.

The work is straightforward and useful on its own terms. The authors built the dataset with a documented six-phase process, measured inter-annotator kappa at 0.68, ran seven models from 8B to 236B parameters, tested prompt and temperature variants, and applied statistical checks for correlation with size. They release the data and benchmark, which makes the baseline reproducible. Parameter count tracks with performance while prompt engineering does not, and the paper states the limitation plainly.

The soft spot is the moderate kappa on a 114-class multi-label task. Even small per-sentence disagreements can create an effective ceiling near the reported F1, so it is hard to separate model weakness from label noise. The 83 reports are also a narrow slice; broader coverage would strengthen the claim that current models are insufficient for production use. Those issues are acknowledged but still cap how far the conclusion can be pushed.

This is for researchers building CTI pipelines or running LLM evaluations in security. It is not a breakthrough but supplies a concrete reference point that was missing. The evidence is empirical and the gaps are minor enough that the paper deserves a serious referee rather than a desk reject.

Referee Report

1 major / 0 minor

Summary. The paper constructs a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive) drawn from 83 complex unstructured CTI reports. These are mapped to 114 ATT&CK techniques via a six-phase annotation process yielding kappa = 0.68. Seven open-source LLMs (8B–236B parameters) are evaluated on multi-label classification under varied prompt strategies and temperatures; the best model reaches micro-averaged F1 = 0.22. The authors conclude that current open-source LLMs remain insufficient for production-grade ATT&CK classification on real CTI, while reporting a statistically significant positive correlation between parameter count and F1 and no significant effect from prompt or temperature.

Significance. If the annotations constitute reliable ground truth, the work supplies the first empirical baseline for LLM performance on realistic multi-label ATT&CK classification from complex CTI reports. Prior evaluations on simplified single-technique sentences likely overstated capabilities; this dataset therefore supplies a reproducible foundation for measuring progress toward production use. The size–performance correlation and null results for prompting/temperature are also actionable for the community.

major comments (1)

[Dataset Construction / Annotation section] Dataset Construction / Annotation section: the central claim that micro-F1 = 0.22 demonstrates that open-source LLMs are insufficient for production-grade classification rests on the 2,076 human labels being a low-noise proxy for the true techniques. With kappa = 0.68 on a 114-class multi-label task, even modest per-sentence disagreement can inject label noise sufficient to cap observable F1 near the reported value. The manuscript should either (a) quantify the effective upper bound on F1 given the observed agreement (e.g., via label-flip simulation on the resolved annotations) or (b) qualify the interpretation to acknowledge that part of the performance gap may be attributable to annotation ambiguity rather than model limitations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on annotation reliability and its implications for interpreting our results. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Dataset Construction / Annotation section] Dataset Construction / Annotation section: the central claim that micro-F1 = 0.22 demonstrates that open-source LLMs are insufficient for production-grade classification rests on the 2,076 human labels being a low-noise proxy for the true techniques. With kappa = 0.68 on a 114-class multi-label task, even modest per-sentence disagreement can inject label noise sufficient to cap observable F1 near the reported value. The manuscript should either (a) quantify the effective upper bound on F1 given the observed agreement (e.g., via label-flip simulation on the resolved annotations) or (b) qualify the interpretation to acknowledge that part of the performance gap may be attributable to annotation ambiguity rather than model limitations.

Authors: We agree that kappa = 0.68 on this complex multi-label task implies some label noise that could contribute to the observed F1 ceiling, and that our central claim would benefit from qualification. Our six-phase annotation process (including independent annotation, conflict resolution, and final review) was intended to minimize ambiguity, yet we acknowledge that residual disagreement remains a factor. We will revise the manuscript to adopt option (b): we will add explicit language in the Discussion section qualifying the interpretation to note that part of the performance gap may be attributable to annotation ambiguity rather than model limitations alone. This provides a balanced view of the results without overstating the evidence for model insufficiency. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical measurement on external human annotations

full rationale

The paper constructs an external ground-truth dataset via six-phase human annotation of 83 CTI reports (2,076 sentences, kappa=0.68) and reports micro-F1 scores of open-source LLMs against those fixed labels. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear. The reported F1=0.22 is a direct statistical comparison to the independently produced annotations and does not reduce to any quantity defined inside the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that the human annotations constitute reliable ground truth and that the 83 reports capture typical CTI complexity; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The six-phase annotation process on sentences from 83 CTI reports yields sufficiently accurate labels for benchmarking (kappa = 0.68).
Invoked to treat the 2,076 sentences as ground truth; moderate agreement leaves room for label noise that could affect the reported F1.

pith-pipeline@v0.9.1-grok · 5864 in / 1298 out tokens · 37355 ms · 2026-06-26T23:59:56.612175+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references

[1]

Estimated cost of cybercrime worldwide 2018–2029,

Statista Research Department, “Estimated cost of cybercrime worldwide 2018–2029,” https://www.statista.com/forecasts/1280009/ cost-cybercrime-worldwide, 2026

arXiv 2018
[2]

MITRE ATT&CK: Design and philosophy,

B. E. Stromet al., “MITRE ATT&CK: Design and philosophy,” The MITRE Corporation, Tech. Rep. MP180360R1, 2018. [Online]. Available: https://attack.mitre.org/docs/ATTACK_Design_ and_Philosophy_March_2020.pdf

2018
[3]

Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,

E. M. Hutchinset al., “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,” Lockheed Martin Corporation, White Paper, 2011. [Online]. Avail- able: https://www.lockheedmartin.com/content/dam/lockheed-martin/ rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf

2011
[4]

The diamond model of intrusion analysis,

S. Caltagironeet al., “The diamond model of intrusion analysis,” Center for Cyber Threat Intelligence and Threat Research, Technical Report ADA586960, 2013. [Online]. Available: https://apps.dtic.mil/sti/ citations/ADA586960

2013
[5]

The vocabulary for event recording and incident sharing (VERIS) framework,

Verizon Risk Team, “The vocabulary for event recording and incident sharing (VERIS) framework,” https://verisframework.org, 2010

2010
[6]

Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,

P. Wanget al., “Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,” IEEE Access, 2026

2026
[7]

Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,

P. N. Wudaliet al., “Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,”arXiv preprint arXiv:2502.02337, 2025

arXiv 2025
[8]

From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,

M. R. Rahman and L. Williams, “From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,” 2022

2022
[9]

What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,

M. R. Rahmanet al., “What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,”ACM Comput. Surv., vol. 55, no. 12, pp. 1–36, 2023

2023
[10]

A survey of large language models,

W. X. Zhaoet al., “A survey of large language models,” 2023

2023
[11]

CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,

M. T. Alamet al., “CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,” inProc. NeurIPS 2024, Datasets and Benchmarks Track. Curran Associates, Inc., 2024, pp. 50 805–50 825

2024
[12]

TRAM: Threat report ATT&CK mapper,

Center for Threat-Informed Defense, “TRAM: Threat report ATT&CK mapper,” https://github.com/center-for-threat-informed-defense/tram, 2020

2020
[13]

Hierarchical RAG for adversarial technique annota- tion,

F. Morbiatoet al., “Hierarchical RAG for adversarial technique annota- tion,” 2026

2026
[14]

What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,

M. Tamannaet al., “What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,” 2026

2026
[15]

Chain-of-thought prompting elicits reasoning in large language models,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS 2022. Curran Associates, Inc., 2022, pp. 24 824–24 837

2022
[16]

Language models are few-shot learners,

T. B. Brownet al., “Language models are few-shot learners,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 1877–1901

2020
[17]

D. C. Montgomery,Design and Analysis of Experiments, 10th ed. Hoboken, NJ: John Wiley & Sons, 2020

2020
[18]

Retrieval-augmented generation for knowledge-intensive NLP tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 9459–9474

2020
[19]

LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,

Y . Schwartzet al., “LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,” inCompanion Proc. ACM WWW 2025. ACM, 2025

2025
[20]

Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,

V . Legoyet al., “Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,”arXiv preprint arXiv:2004.14322, 2020

arXiv 2004
[21]

CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,

S. Della Pennaet al., “CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,” 2025

2025
[22]

TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,

G. Husariet al., “TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,” inProc. ACSAC 2017. ACM, 2017, pp. 103–115

2017
[23]

AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,

Z. Liet al., “AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,” inProc. ESORICS 2022. Springer, 2022, pp. 589–609

2022
[24]

SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,

B. Abdeenet al., “SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,” inProc. DBSec 2023. Springer, 2023, pp. 243–260

2023
[25]

SecureBERT: A domain-specific language model for cybersecurity,

E. Aghaeiet al., “SecureBERT: A domain-specific language model for cybersecurity,” inProc. SecureComm 2022. Springer, 2023, pp. 39–56

2022
[26]

FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,

S. Mitraet al., “FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,” 2025

2025
[27]

Large language models are unreliable for cyber threat intelligence,

E. Mezziet al., “Large language models are unreliable for cyber threat intelligence,” 2025

2025
[28]

The DFIR report — real intrusions by real attackers,

The DFIR Report, “The DFIR report — real intrusions by real attackers,” https://thedfirreport.com, 2024

2024
[29]

Scrapy: An open source and collaborative framework for extracting data from websites,

Zyte and Scrapy Developers, “Scrapy: An open source and collaborative framework for extracting data from websites,” https://scrapy.org, 2024

2024
[30]

Beautiful Soup: A python library for pulling data out of HTML and XML files,

L. Richardson, “Beautiful Soup: A python library for pulling data out of HTML and XML files,” https://www.crummy.com/software/ BeautifulSoup/, 2024

2024
[31]

Newspaper3k: Article scraping and curation,

L. Ou-Yang, “Newspaper3k: Article scraping and curation,” https:// github.com/codelucas/newspaper, 2024

2024
[32]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educ. Psychol. Meas., vol. 20, no. 1, pp. 37–46, 1960

1960
[33]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

1977
[34]

The curious case of neural text degeneration,

A. Holtzmanet al., “The curious case of neural text degeneration,” inProc. ICLR 2020, 2020. [Online]. Available: https://openreview.net/ forum?id=rygGQyrFvH

2020
[35]

Scaling laws for neural language models,

J. Kaplanet al., “Scaling laws for neural language models,” 2020

2020
[36]

The Llama 3 herd of models,

A. Dubeyet al., “The Llama 3 herd of models,” 2024

2024
[37]

Gemma 2: Improving open language models at a practical size,

Gemma Teamet al., “Gemma 2: Improving open language models at a practical size,” 2024

2024
[38]

DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,

DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024

2024
[39]

DeepSeek-V2.5 model repository,

——, “DeepSeek-V2.5 model repository,” https://huggingface.co/ deepseek-ai/DeepSeek-V2.5, 2024

2024
[40]

GPT-OSS-120B model repository,

OpenAI, “GPT-OSS-120B model repository,” https://huggingface.co/ openai/gpt-oss-120b, 2024

2024
[41]

Meta-Llama-3.1-70B-Instruct model repository,

Meta AI, “Meta-Llama-3.1-70B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct, 2024

2024
[42]

Gemma-3-27B model repository,

Google DeepMind, “Gemma-3-27B model repository,” https://huggingface.co/google/gemma-3-27b-it, 2024

2024
[43]

GPT-OSS-20B model repository,

OpenAI, “GPT-OSS-20B model repository,” https://huggingface.co/ openai/gpt-oss-20b, 2024

2024
[44]

Gemma-3-12B model repository,

Google DeepMind, “Gemma-3-12B model repository,” https://huggingface.co/google/gemma-3-12b-it, 2024

2024
[45]

Meta-Llama-3.1-8B-Instruct model repository,

Meta AI, “Meta-Llama-3.1-8B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, 2024

2024
[46]

Faiss: A library for efficient similarity search and clustering of dense vectors,

Meta Platforms, Inc., “Faiss: A library for efficient similarity search and clustering of dense vectors,” https://faiss.ai, 2024

2024
[47]

The proof and measurement of association between two things,

C. Spearman, “The proof and measurement of association between two things,”Am. J. Psychol., vol. 15, no. 1, pp. 72–101, 1904

1904
[48]

TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,

N. Raniet al., “TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,” inProc. ACSW 2023. ACM, 2023, pp. 126–134

2023
[49]

TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,

——, “TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,”Digit. Threats Res. Pract., vol. 5, no. 4, pp. 1–19, 2024

2024
[50]

SoK: Automated TTP extraction from CTI reports,

M. Büchelet al., “SoK: Automated TTP extraction from CTI reports,” in Proc. USENIX Security 2025. USENIX Association, 2025, pp. 4621– 4641

2025
[51]

Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,

H. C. Nguyenet al., “Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,” 2025

2025
[52]

Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,

M. N. Haqueet al., “Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,” in Proc. ASE 2026. IEEE/ACM, 2026

2026
[53]

Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,

R. Fayyaziet al., “Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,” 2024

2024
[54]

Learning and evaluation in the presence of class hierarchies: Application to text categorization,

S. Kiritchenkoet al., “Learning and evaluation in the presence of class hierarchies: Application to text categorization,” inProc. Canadian AI
[55]

Springer, 2006, pp. 395–406

2006
[56]

Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,

A. Kosmopouloset al., “Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,”Data Min. Knowl. Discov., vol. 29, no. 3, pp. 820–865, 2015

2015
[57]

CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,

M. Faliset al., “CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,” inProc. EMNLP
[58]

Association for Computational Linguistics, 2021, pp. 907–912

2021
[59]

Mining temporal attack patterns from cyberthreat intelligence reports,

M. R. Rahmanet al., “Mining temporal attack patterns from cyberthreat intelligence reports,”Knowl. Inf. Syst., vol. 67, no. 10, pp. 8941–8981, 2025

2025
[60]

ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,

——, “ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,” inProc. IEEE ICDM 2024. IEEE, 2024, pp. 420–429

2024

[1] [1]

Estimated cost of cybercrime worldwide 2018–2029,

Statista Research Department, “Estimated cost of cybercrime worldwide 2018–2029,” https://www.statista.com/forecasts/1280009/ cost-cybercrime-worldwide, 2026

arXiv 2018

[2] [2]

MITRE ATT&CK: Design and philosophy,

B. E. Stromet al., “MITRE ATT&CK: Design and philosophy,” The MITRE Corporation, Tech. Rep. MP180360R1, 2018. [Online]. Available: https://attack.mitre.org/docs/ATTACK_Design_ and_Philosophy_March_2020.pdf

2018

[3] [3]

Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,

E. M. Hutchinset al., “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,” Lockheed Martin Corporation, White Paper, 2011. [Online]. Avail- able: https://www.lockheedmartin.com/content/dam/lockheed-martin/ rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf

2011

[4] [4]

The diamond model of intrusion analysis,

S. Caltagironeet al., “The diamond model of intrusion analysis,” Center for Cyber Threat Intelligence and Threat Research, Technical Report ADA586960, 2013. [Online]. Available: https://apps.dtic.mil/sti/ citations/ADA586960

2013

[5] [5]

The vocabulary for event recording and incident sharing (VERIS) framework,

Verizon Risk Team, “The vocabulary for event recording and incident sharing (VERIS) framework,” https://verisframework.org, 2010

2010

[6] [6]

Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,

P. Wanget al., “Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,” IEEE Access, 2026

2026

[7] [7]

Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,

P. N. Wudaliet al., “Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,”arXiv preprint arXiv:2502.02337, 2025

arXiv 2025

[8] [8]

From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,

M. R. Rahman and L. Williams, “From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,” 2022

2022

[9] [9]

What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,

M. R. Rahmanet al., “What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,”ACM Comput. Surv., vol. 55, no. 12, pp. 1–36, 2023

2023

[10] [10]

A survey of large language models,

W. X. Zhaoet al., “A survey of large language models,” 2023

2023

[11] [11]

CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,

M. T. Alamet al., “CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,” inProc. NeurIPS 2024, Datasets and Benchmarks Track. Curran Associates, Inc., 2024, pp. 50 805–50 825

2024

[12] [12]

TRAM: Threat report ATT&CK mapper,

Center for Threat-Informed Defense, “TRAM: Threat report ATT&CK mapper,” https://github.com/center-for-threat-informed-defense/tram, 2020

2020

[13] [13]

Hierarchical RAG for adversarial technique annota- tion,

F. Morbiatoet al., “Hierarchical RAG for adversarial technique annota- tion,” 2026

2026

[14] [14]

What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,

M. Tamannaet al., “What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,” 2026

2026

[15] [15]

Chain-of-thought prompting elicits reasoning in large language models,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS 2022. Curran Associates, Inc., 2022, pp. 24 824–24 837

2022

[16] [16]

Language models are few-shot learners,

T. B. Brownet al., “Language models are few-shot learners,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 1877–1901

2020

[17] [17]

D. C. Montgomery,Design and Analysis of Experiments, 10th ed. Hoboken, NJ: John Wiley & Sons, 2020

2020

[18] [18]

Retrieval-augmented generation for knowledge-intensive NLP tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 9459–9474

2020

[19] [19]

LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,

Y . Schwartzet al., “LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,” inCompanion Proc. ACM WWW 2025. ACM, 2025

2025

[20] [20]

Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,

V . Legoyet al., “Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,”arXiv preprint arXiv:2004.14322, 2020

arXiv 2004

[21] [21]

CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,

S. Della Pennaet al., “CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,” 2025

2025

[22] [22]

TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,

G. Husariet al., “TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,” inProc. ACSAC 2017. ACM, 2017, pp. 103–115

2017

[23] [23]

AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,

Z. Liet al., “AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,” inProc. ESORICS 2022. Springer, 2022, pp. 589–609

2022

[24] [24]

SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,

B. Abdeenet al., “SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,” inProc. DBSec 2023. Springer, 2023, pp. 243–260

2023

[25] [25]

SecureBERT: A domain-specific language model for cybersecurity,

E. Aghaeiet al., “SecureBERT: A domain-specific language model for cybersecurity,” inProc. SecureComm 2022. Springer, 2023, pp. 39–56

2022

[26] [26]

FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,

S. Mitraet al., “FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,” 2025

2025

[27] [27]

Large language models are unreliable for cyber threat intelligence,

E. Mezziet al., “Large language models are unreliable for cyber threat intelligence,” 2025

2025

[28] [28]

The DFIR report — real intrusions by real attackers,

The DFIR Report, “The DFIR report — real intrusions by real attackers,” https://thedfirreport.com, 2024

2024

[29] [29]

Scrapy: An open source and collaborative framework for extracting data from websites,

Zyte and Scrapy Developers, “Scrapy: An open source and collaborative framework for extracting data from websites,” https://scrapy.org, 2024

2024

[30] [30]

Beautiful Soup: A python library for pulling data out of HTML and XML files,

L. Richardson, “Beautiful Soup: A python library for pulling data out of HTML and XML files,” https://www.crummy.com/software/ BeautifulSoup/, 2024

2024

[31] [31]

Newspaper3k: Article scraping and curation,

L. Ou-Yang, “Newspaper3k: Article scraping and curation,” https:// github.com/codelucas/newspaper, 2024

2024

[32] [32]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educ. Psychol. Meas., vol. 20, no. 1, pp. 37–46, 1960

1960

[33] [33]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

1977

[34] [34]

The curious case of neural text degeneration,

A. Holtzmanet al., “The curious case of neural text degeneration,” inProc. ICLR 2020, 2020. [Online]. Available: https://openreview.net/ forum?id=rygGQyrFvH

2020

[35] [35]

Scaling laws for neural language models,

J. Kaplanet al., “Scaling laws for neural language models,” 2020

2020

[36] [36]

The Llama 3 herd of models,

A. Dubeyet al., “The Llama 3 herd of models,” 2024

2024

[37] [37]

Gemma 2: Improving open language models at a practical size,

Gemma Teamet al., “Gemma 2: Improving open language models at a practical size,” 2024

2024

[38] [38]

DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,

DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024

2024

[39] [39]

DeepSeek-V2.5 model repository,

——, “DeepSeek-V2.5 model repository,” https://huggingface.co/ deepseek-ai/DeepSeek-V2.5, 2024

2024

[40] [40]

GPT-OSS-120B model repository,

OpenAI, “GPT-OSS-120B model repository,” https://huggingface.co/ openai/gpt-oss-120b, 2024

2024

[41] [41]

Meta-Llama-3.1-70B-Instruct model repository,

Meta AI, “Meta-Llama-3.1-70B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct, 2024

2024

[42] [42]

Gemma-3-27B model repository,

Google DeepMind, “Gemma-3-27B model repository,” https://huggingface.co/google/gemma-3-27b-it, 2024

2024

[43] [43]

GPT-OSS-20B model repository,

OpenAI, “GPT-OSS-20B model repository,” https://huggingface.co/ openai/gpt-oss-20b, 2024

2024

[44] [44]

Gemma-3-12B model repository,

Google DeepMind, “Gemma-3-12B model repository,” https://huggingface.co/google/gemma-3-12b-it, 2024

2024

[45] [45]

Meta-Llama-3.1-8B-Instruct model repository,

Meta AI, “Meta-Llama-3.1-8B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, 2024

2024

[46] [46]

Faiss: A library for efficient similarity search and clustering of dense vectors,

Meta Platforms, Inc., “Faiss: A library for efficient similarity search and clustering of dense vectors,” https://faiss.ai, 2024

2024

[47] [47]

The proof and measurement of association between two things,

C. Spearman, “The proof and measurement of association between two things,”Am. J. Psychol., vol. 15, no. 1, pp. 72–101, 1904

1904

[48] [48]

TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,

N. Raniet al., “TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,” inProc. ACSW 2023. ACM, 2023, pp. 126–134

2023

[49] [49]

TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,

——, “TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,”Digit. Threats Res. Pract., vol. 5, no. 4, pp. 1–19, 2024

2024

[50] [50]

SoK: Automated TTP extraction from CTI reports,

M. Büchelet al., “SoK: Automated TTP extraction from CTI reports,” in Proc. USENIX Security 2025. USENIX Association, 2025, pp. 4621– 4641

2025

[51] [51]

Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,

H. C. Nguyenet al., “Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,” 2025

2025

[52] [52]

Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,

M. N. Haqueet al., “Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,” in Proc. ASE 2026. IEEE/ACM, 2026

2026

[53] [53]

Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,

R. Fayyaziet al., “Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,” 2024

2024

[54] [54]

Learning and evaluation in the presence of class hierarchies: Application to text categorization,

S. Kiritchenkoet al., “Learning and evaluation in the presence of class hierarchies: Application to text categorization,” inProc. Canadian AI

[55] [55]

Springer, 2006, pp. 395–406

2006

[56] [56]

Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,

A. Kosmopouloset al., “Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,”Data Min. Knowl. Discov., vol. 29, no. 3, pp. 820–865, 2015

2015

[57] [57]

CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,

M. Faliset al., “CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,” inProc. EMNLP

[58] [58]

Association for Computational Linguistics, 2021, pp. 907–912

2021

[59] [59]

Mining temporal attack patterns from cyberthreat intelligence reports,

M. R. Rahmanet al., “Mining temporal attack patterns from cyberthreat intelligence reports,”Knowl. Inf. Syst., vol. 67, no. 10, pp. 8941–8981, 2025

2025

[60] [60]

ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,

——, “ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,” inProc. IEEE ICDM 2024. IEEE, 2024, pp. 420–429

2024