pith. sign in

arxiv: 2606.18166 · v1 · pith:EFK2T6NEnew · submitted 2026-06-16 · 💻 cs.CR · cs.LG

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

Pith reviewed 2026-06-26 23:59 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords LLM evaluationATT&CK classificationCyber Threat IntelligenceMulti-label classificationCTI reportsOpen-source modelsTechnique labeling
0
0 comments X

The pith

Open-source LLMs reach only 0.22 micro F1 on multi-label ATT&CK classification of real CTI reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset of 2,076 sentences drawn from 83 complex unstructured CTI reports and maps them via human annotation to 114 ATT&CK techniques. It then runs seven open-source LLMs on the task of assigning multiple technique labels to each sentence. The strongest result is a micro-averaged F1 of 0.22, with model size showing a positive correlation to performance while prompt choice and temperature do not. This supplies the first empirical baseline for the harder, more realistic version of the problem and indicates that current models remain below production thresholds.

Core claim

Using a ground-truth dataset of 2,076 human-annotated sentences from 83 complex unstructured CTI reports mapped to 114 unique ATT&CK techniques, evaluation of seven open-source LLMs shows a highest micro-averaged F1 score of 0.22, establishing an empirical baseline and indicating that current open-source LLMs are insufficient for production-grade multi-label ATT&CK classification on complex CTI.

What carries the argument

A six-phase human-annotated dataset of 2,076 sentences from real CTI reports, each carrying multi-label ATT&CK technique assignments, used as the test bed for LLM performance across parameter sizes, prompts, and temperatures.

If this is right

  • Parameter count correlates positively with F1 score across the tested models.
  • Neither prompt strategy nor temperature setting yields statistically significant gains.
  • The 0.22 micro F1 result marks the performance level current open-source LLMs achieve on realistic multi-label ATT&CK tasks.
  • The released dataset and benchmark supply a fixed reference point for measuring future progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving annotation agreement beyond kappa 0.68 could raise the measurable ceiling for any automated system.
  • Hybrid approaches that combine LLMs with rule-based or retrieval components might exceed the reported baseline without requiring larger models.
  • The gap between this result and production needs points to a requirement for domain-specific fine-tuning or architectural changes tailored to attack-pattern language.

Load-bearing premise

The six-phase annotation process on 83 reports produces labels accurate and representative enough of real-world CTI complexity to serve as ground truth.

What would settle it

An independent run on a fresh collection of complex CTI reports, using the same annotation protocol, in which any open-source LLM exceeds 0.30 micro F1 would falsify the claim that 0.22 represents the current ceiling.

read the original abstract

Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper constructs a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive) drawn from 83 complex unstructured CTI reports. These are mapped to 114 ATT&CK techniques via a six-phase annotation process yielding kappa = 0.68. Seven open-source LLMs (8B–236B parameters) are evaluated on multi-label classification under varied prompt strategies and temperatures; the best model reaches micro-averaged F1 = 0.22. The authors conclude that current open-source LLMs remain insufficient for production-grade ATT&CK classification on real CTI, while reporting a statistically significant positive correlation between parameter count and F1 and no significant effect from prompt or temperature.

Significance. If the annotations constitute reliable ground truth, the work supplies the first empirical baseline for LLM performance on realistic multi-label ATT&CK classification from complex CTI reports. Prior evaluations on simplified single-technique sentences likely overstated capabilities; this dataset therefore supplies a reproducible foundation for measuring progress toward production use. The size–performance correlation and null results for prompting/temperature are also actionable for the community.

major comments (1)
  1. [Dataset Construction / Annotation section] Dataset Construction / Annotation section: the central claim that micro-F1 = 0.22 demonstrates that open-source LLMs are insufficient for production-grade classification rests on the 2,076 human labels being a low-noise proxy for the true techniques. With kappa = 0.68 on a 114-class multi-label task, even modest per-sentence disagreement can inject label noise sufficient to cap observable F1 near the reported value. The manuscript should either (a) quantify the effective upper bound on F1 given the observed agreement (e.g., via label-flip simulation on the resolved annotations) or (b) qualify the interpretation to acknowledge that part of the performance gap may be attributable to annotation ambiguity rather than model limitations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on annotation reliability and its implications for interpreting our results. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Dataset Construction / Annotation section] Dataset Construction / Annotation section: the central claim that micro-F1 = 0.22 demonstrates that open-source LLMs are insufficient for production-grade classification rests on the 2,076 human labels being a low-noise proxy for the true techniques. With kappa = 0.68 on a 114-class multi-label task, even modest per-sentence disagreement can inject label noise sufficient to cap observable F1 near the reported value. The manuscript should either (a) quantify the effective upper bound on F1 given the observed agreement (e.g., via label-flip simulation on the resolved annotations) or (b) qualify the interpretation to acknowledge that part of the performance gap may be attributable to annotation ambiguity rather than model limitations.

    Authors: We agree that kappa = 0.68 on this complex multi-label task implies some label noise that could contribute to the observed F1 ceiling, and that our central claim would benefit from qualification. Our six-phase annotation process (including independent annotation, conflict resolution, and final review) was intended to minimize ambiguity, yet we acknowledge that residual disagreement remains a factor. We will revise the manuscript to adopt option (b): we will add explicit language in the Discussion section qualifying the interpretation to note that part of the performance gap may be attributable to annotation ambiguity rather than model limitations alone. This provides a balanced view of the results without overstating the evidence for model insufficiency. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical measurement on external human annotations

full rationale

The paper constructs an external ground-truth dataset via six-phase human annotation of 83 CTI reports (2,076 sentences, kappa=0.68) and reports micro-F1 scores of open-source LLMs against those fixed labels. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear. The reported F1=0.22 is a direct statistical comparison to the independently produced annotations and does not reduce to any quantity defined inside the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that the human annotations constitute reliable ground truth and that the 83 reports capture typical CTI complexity; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The six-phase annotation process on sentences from 83 CTI reports yields sufficiently accurate labels for benchmarking (kappa = 0.68).
    Invoked to treat the 2,076 sentences as ground truth; moderate agreement leaves room for label noise that could affect the reported F1.

pith-pipeline@v0.9.1-grok · 5864 in / 1298 out tokens · 37355 ms · 2026-06-26T23:59:56.612175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references

  1. [1]

    Estimated cost of cybercrime worldwide 2018–2029,

    Statista Research Department, “Estimated cost of cybercrime worldwide 2018–2029,” https://www.statista.com/forecasts/1280009/ cost-cybercrime-worldwide, 2026

  2. [2]

    MITRE ATT&CK: Design and philosophy,

    B. E. Stromet al., “MITRE ATT&CK: Design and philosophy,” The MITRE Corporation, Tech. Rep. MP180360R1, 2018. [Online]. Available: https://attack.mitre.org/docs/ATTACK_Design_ and_Philosophy_March_2020.pdf

  3. [3]

    Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,

    E. M. Hutchinset al., “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains,” Lockheed Martin Corporation, White Paper, 2011. [Online]. Avail- able: https://www.lockheedmartin.com/content/dam/lockheed-martin/ rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf

  4. [4]

    The diamond model of intrusion analysis,

    S. Caltagironeet al., “The diamond model of intrusion analysis,” Center for Cyber Threat Intelligence and Threat Research, Technical Report ADA586960, 2013. [Online]. Available: https://apps.dtic.mil/sti/ citations/ADA586960

  5. [5]

    The vocabulary for event recording and incident sharing (VERIS) framework,

    Verizon Risk Team, “The vocabulary for event recording and incident sharing (VERIS) framework,” https://verisframework.org, 2010

  6. [6]

    Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,

    P. Wanget al., “Research on discovery and mapping of ATT&CK tactics and techniques by cyber threat intelligence based on BERT-TextCNN,” IEEE Access, 2026

  7. [7]

    Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,

    P. N. Wudaliet al., “Rule-ATT&CK mapper (RAM): Mapping SIEM rules to TTPs using LLMs,”arXiv preprint arXiv:2502.02337, 2025

  8. [8]

    From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,

    M. R. Rahman and L. Williams, “From threat reports to continuous threat intelligence: A comparison of attack technique extraction methods from textual artifacts,” 2022

  9. [9]

    What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,

    M. R. Rahmanet al., “What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey,”ACM Comput. Surv., vol. 55, no. 12, pp. 1–36, 2023

  10. [10]

    A survey of large language models,

    W. X. Zhaoet al., “A survey of large language models,” 2023

  11. [11]

    CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,

    M. T. Alamet al., “CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence,” inProc. NeurIPS 2024, Datasets and Benchmarks Track. Curran Associates, Inc., 2024, pp. 50 805–50 825

  12. [12]

    TRAM: Threat report ATT&CK mapper,

    Center for Threat-Informed Defense, “TRAM: Threat report ATT&CK mapper,” https://github.com/center-for-threat-informed-defense/tram, 2020

  13. [13]

    Hierarchical RAG for adversarial technique annota- tion,

    F. Morbiatoet al., “Hierarchical RAG for adversarial technique annota- tion,” 2026

  14. [14]

    What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,

    M. Tamannaet al., “What are adversaries doing? automating tactics, techniques, and procedures extraction: A systematic review,” 2026

  15. [15]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Weiet al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS 2022. Curran Associates, Inc., 2022, pp. 24 824–24 837

  16. [16]

    Language models are few-shot learners,

    T. B. Brownet al., “Language models are few-shot learners,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 1877–1901

  17. [17]

    D. C. Montgomery,Design and Analysis of Experiments, 10th ed. Hoboken, NJ: John Wiley & Sons, 2020

  18. [18]

    Retrieval-augmented generation for knowledge-intensive NLP tasks,

    P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inProc. NeurIPS 2020. Curran Associates, Inc., 2020, pp. 9459–9474

  19. [19]

    LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,

    Y . Schwartzet al., “LLMCloudHunter: Harnessing LLMs for automated extraction of detection rules from cloud-based CTI,” inCompanion Proc. ACM WWW 2025. ACM, 2025

  20. [20]

    Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,

    V . Legoyet al., “Automated retrieval of ATT&CK tactics and techniques for cyber threat reports,”arXiv preprint arXiv:2004.14322, 2020

  21. [21]

    CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,

    S. Della Pennaet al., “CTI-HAL: A human-annotated dataset for cyber threat intelligence analysis,” 2025

  22. [22]

    TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,

    G. Husariet al., “TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources,” inProc. ACSAC 2017. ACM, 2017, pp. 103–115

  23. [23]

    AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,

    Z. Liet al., “AttacKG: Constructing technique knowledge graph from cyber threat intelligence reports,” inProc. ESORICS 2022. Springer, 2022, pp. 589–609

  24. [24]

    SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,

    B. Abdeenet al., “SMET: Semantic mapping of CVE to ATT&CK and its application to cybersecurity,” inProc. DBSec 2023. Springer, 2023, pp. 243–260

  25. [25]

    SecureBERT: A domain-specific language model for cybersecurity,

    E. Aghaeiet al., “SecureBERT: A domain-specific language model for cybersecurity,” inProc. SecureComm 2022. Springer, 2023, pp. 39–56

  26. [26]

    FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,

    S. Mitraet al., “FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation,” 2025

  27. [27]

    Large language models are unreliable for cyber threat intelligence,

    E. Mezziet al., “Large language models are unreliable for cyber threat intelligence,” 2025

  28. [28]

    The DFIR report — real intrusions by real attackers,

    The DFIR Report, “The DFIR report — real intrusions by real attackers,” https://thedfirreport.com, 2024

  29. [29]

    Scrapy: An open source and collaborative framework for extracting data from websites,

    Zyte and Scrapy Developers, “Scrapy: An open source and collaborative framework for extracting data from websites,” https://scrapy.org, 2024

  30. [30]

    Beautiful Soup: A python library for pulling data out of HTML and XML files,

    L. Richardson, “Beautiful Soup: A python library for pulling data out of HTML and XML files,” https://www.crummy.com/software/ BeautifulSoup/, 2024

  31. [31]

    Newspaper3k: Article scraping and curation,

    L. Ou-Yang, “Newspaper3k: Article scraping and curation,” https:// github.com/codelucas/newspaper, 2024

  32. [32]

    A coefficient of agreement for nominal scales,

    J. Cohen, “A coefficient of agreement for nominal scales,”Educ. Psychol. Meas., vol. 20, no. 1, pp. 37–46, 1960

  33. [33]

    The measurement of observer agreement for categorical data,

    J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

  34. [34]

    The curious case of neural text degeneration,

    A. Holtzmanet al., “The curious case of neural text degeneration,” inProc. ICLR 2020, 2020. [Online]. Available: https://openreview.net/ forum?id=rygGQyrFvH

  35. [35]

    Scaling laws for neural language models,

    J. Kaplanet al., “Scaling laws for neural language models,” 2020

  36. [36]

    The Llama 3 herd of models,

    A. Dubeyet al., “The Llama 3 herd of models,” 2024

  37. [37]

    Gemma 2: Improving open language models at a practical size,

    Gemma Teamet al., “Gemma 2: Improving open language models at a practical size,” 2024

  38. [38]

    DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,

    DeepSeek-AI, “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024

  39. [39]

    DeepSeek-V2.5 model repository,

    ——, “DeepSeek-V2.5 model repository,” https://huggingface.co/ deepseek-ai/DeepSeek-V2.5, 2024

  40. [40]

    GPT-OSS-120B model repository,

    OpenAI, “GPT-OSS-120B model repository,” https://huggingface.co/ openai/gpt-oss-120b, 2024

  41. [41]

    Meta-Llama-3.1-70B-Instruct model repository,

    Meta AI, “Meta-Llama-3.1-70B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct, 2024

  42. [42]

    Gemma-3-27B model repository,

    Google DeepMind, “Gemma-3-27B model repository,” https://huggingface.co/google/gemma-3-27b-it, 2024

  43. [43]

    GPT-OSS-20B model repository,

    OpenAI, “GPT-OSS-20B model repository,” https://huggingface.co/ openai/gpt-oss-20b, 2024

  44. [44]

    Gemma-3-12B model repository,

    Google DeepMind, “Gemma-3-12B model repository,” https://huggingface.co/google/gemma-3-12b-it, 2024

  45. [45]

    Meta-Llama-3.1-8B-Instruct model repository,

    Meta AI, “Meta-Llama-3.1-8B-Instruct model repository,” https:// huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct, 2024

  46. [46]

    Faiss: A library for efficient similarity search and clustering of dense vectors,

    Meta Platforms, Inc., “Faiss: A library for efficient similarity search and clustering of dense vectors,” https://faiss.ai, 2024

  47. [47]

    The proof and measurement of association between two things,

    C. Spearman, “The proof and measurement of association between two things,”Am. J. Psychol., vol. 15, no. 1, pp. 72–101, 1904

  48. [48]

    TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,

    N. Raniet al., “TTPHunter: Automated extraction of actionable intel- ligence as TTPs from narrative threat reports,” inProc. ACSW 2023. ACM, 2023, pp. 126–134

  49. [49]

    TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,

    ——, “TTPXHunter: Actionable threat intelligence extraction as TTPs from finished cyber threat reports,”Digit. Threats Res. Pract., vol. 5, no. 4, pp. 1–19, 2024

  50. [50]

    SoK: Automated TTP extraction from CTI reports,

    M. Büchelet al., “SoK: Automated TTP extraction from CTI reports,” in Proc. USENIX Security 2025. USENIX Association, 2025, pp. 4621– 4641

  51. [51]

    Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,

    H. C. Nguyenet al., “Towards effective identification of attack tech- niques in cyber threat intelligence reports using large language models,” 2025

  52. [52]

    Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,

    M. N. Haqueet al., “Beyond single reports: Evaluating automated ATT&CK technique extraction in multi-report campaign settings,” in Proc. ASE 2026. IEEE/ACM, 2026

  53. [53]

    Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,

    R. Fayyaziet al., “Advancing TTP analysis: Harnessing the power of large language models with retrieval augmented generation,” 2024

  54. [54]

    Learning and evaluation in the presence of class hierarchies: Application to text categorization,

    S. Kiritchenkoet al., “Learning and evaluation in the presence of class hierarchies: Application to text categorization,” inProc. Canadian AI

  55. [55]

    Springer, 2006, pp. 395–406

  56. [56]

    Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,

    A. Kosmopouloset al., “Evaluation measures for hierarchical classifica- tion: A unified view and novel approaches,”Data Min. Knowl. Discov., vol. 29, no. 3, pp. 820–865, 2015

  57. [57]

    CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,

    M. Faliset al., “CoPHE: A count-preserving hierarchical evaluation metric in large-scale multi-label text classification,” inProc. EMNLP

  58. [58]

    Association for Computational Linguistics, 2021, pp. 907–912

  59. [59]

    Mining temporal attack patterns from cyberthreat intelligence reports,

    M. R. Rahmanet al., “Mining temporal attack patterns from cyberthreat intelligence reports,”Knowl. Inf. Syst., vol. 67, no. 10, pp. 8941–8981, 2025

  60. [60]

    ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,

    ——, “ChronoCTI: Mining knowledge graph of temporal relations among cyberattack actions,” inProc. IEEE ICDM 2024. IEEE, 2024, pp. 420–429