pith. sign in

arxiv: 2601.03013 · v4 · submitted 2026-01-06 · 💻 cs.CR

LLMs, You Can Evaluate It! Design of Multi-perspective Report Evaluation for Security Operation Centers

Pith reviewed 2026-05-16 16:56 UTC · model grok-4.3

classification 💻 cs.CR
keywords security operation centersLLM evaluationincident report assessmentmulti-perspective evaluationanalyst checklistgranularization guidelineSOC feedback generationveteran practitioner alignment
0
0 comments X

The pith

MESSALA uses an analyst checklist, granularization, and multi-perspective checks so LLMs evaluate SOC reports closest to veteran practitioners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first builds an Analyst-wise checklist from literature and direct input from SOC practitioners to capture how experts judge incident analysis reports. It then presents MESSALA, an LLM framework that adds granularization guidelines to break down report elements and multi-perspective evaluation to mirror veteran reasoning. Experiments show MESSALA's scores and feedback align more closely with veteran judgments than prior LLM evaluators. The system also surfaces concrete, actionable suggestions for improving the reports themselves.

Core claim

MESSALA, built on an Analyst-wise checklist from practitioner studies plus two new techniques called granularization guideline and multi-perspective evaluation, lets LLMs produce report assessments and feedback that match veteran SOC practitioners more closely than existing LLM-based methods, while identifying specific items needed to raise report quality.

What carries the argument

MESSALA framework, which applies an Analyst-wise checklist together with granularization guidelines and multi-perspective evaluation to align LLM outputs with SOC expert perceptions.

If this is right

  • MESSALA produces evaluation scores closer to veterans than other LLM evaluators across extensive tests.
  • The framework generates feedback that practitioners recognize as aligned with their own views.
  • Qualitative review shows MESSALA identifies concrete, actionable changes for report improvement.
  • SOC teams can use the method to standardize report quality checks without constant expert involvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar checklist-plus-multi-perspective designs could be tested for LLM evaluation of code or medical reports.
  • Integrating MESSALA-style checks during report drafting might reduce the need for later revisions.
  • Scaling the checklist to new threat types would require fresh practitioner input to stay accurate.

Load-bearing premise

The Analyst-wise checklist built from literature and a small user study with SOC practitioners accurately reflects the full set of criteria veteran analysts use.

What would settle it

A follow-up study in which a separate group of veteran SOC analysts scores the same set of reports and MESSALA's outputs diverge significantly from their consensus on key dimensions.

Figures

Figures reproduced from arXiv: 2601.03013 by Hiroyuki Okada, Naoto Yanai, Tatsumi Oba.

Figure 1
Figure 1. Figure 1: Overview of MESSALA. The top module runs two parallel processes: High-level Evalu [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Granularization Guideline in use. The guideline illustrates how checklist [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Prompts by Category in the Granularization Guideline. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Prompts for Granularization and Evaluation. The red text indicates the dis [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of pseudo analysis report generation. The left image shows an example prompt [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LM-Generated Evaluation Examples for Each Method Using Sanitized Reports and Check [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Violin plots for the distribution of each method. Similar forms of viloins represent similar [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Differences in comment content across reports measured by ITF-IDF. [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: According to the figure, we observe that MESSALA produces report-specific [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of the report excerpts before and after defect injection. Red text highlights the [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of Evaluation Comments: Opaque Decision Rationale category [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of Evaluation Comments: Unverifiable or One-Sided Analysis category [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
read the original abstract

Security operation centers (SOCs) often produce analysis reports on security incidents, and large language models (LLMs) will likely be used for this task in the near future. We postulate that a better understanding of how veteran analysts evaluate reports, including their feedback, can help produce analysis reports in SOCs. In this paper, we aim to leverage LLMs for analysis reports. To this end, we first construct a Analyst-wise checklist to reflect SOC practitioners' opinions for analysis report evaluation through literature review and user study with SOC practitioners. Next, we design a novel LLM-based conceptual framework, named MESSALA, by further introducing two new techniques, granularization guideline and multi-perspective evaluation. MESSALA can maximize report evaluation and provide feedback on veteran SOC practitioners' perceptions. When we conduct extensive experiments with MESSALA, the evaluation results by MESSALA are the closest to those of veteran SOC practitioners compared with the existing LLM-based methods. We then show two key insights. We also conduct qualitative analysis with MESSALA, and then identify that MESSALA can provide actionable items that are necessary for improving analysis reports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MESSALA, an LLM-based framework for evaluating security incident analysis reports produced by SOCs. It constructs an Analyst-wise checklist via literature review and a user study with SOC practitioners to capture veteran evaluation criteria, then introduces granularization guidelines and multi-perspective evaluation techniques. Experiments claim that MESSALA produces evaluations closest to those of veteran practitioners compared to prior LLM methods, while also yielding two key insights and actionable qualitative feedback for report improvement.

Significance. If the alignment claim is substantiated with proper validation metrics, the work could meaningfully advance LLM deployment in cybersecurity operations by aligning automated evaluations more closely with expert judgment, potentially improving report quality, incident response, and knowledge transfer in SOCs. The emphasis on actionable feedback is a practical contribution.

major comments (2)
  1. [Analyst-wise checklist construction and user study] The Analyst-wise checklist construction (described after the literature review and user study) is load-bearing for the central claim that MESSALA aligns best with veterans. No sample size, inter-rater reliability, coverage metrics, or validation against unstructured veteran judgments on the same reports are reported, so it is unclear whether the checklist is a faithful proxy or reflects study-specific biases. This directly affects whether matching the checklist demonstrates real practitioner alignment.
  2. [Experiments and evaluation results] The experimental section asserts that MESSALA results are 'the closest' to veteran evaluations but supplies no quantitative metrics (e.g., correlation coefficients, distance measures), experiment sample sizes, statistical significance tests, or details on how closeness was computed. Without these, the superiority claim over existing LLM methods cannot be evaluated for robustness.
minor comments (1)
  1. [Abstract] The abstract states that 'two key insights' are shown but does not preview them, reducing clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where additional details will strengthen the presentation of our contributions. We address each major comment below and have revised the manuscript to incorporate the requested information.

read point-by-point responses
  1. Referee: [Analyst-wise checklist construction and user study] The Analyst-wise checklist construction (described after the literature review and user study) is load-bearing for the central claim that MESSALA aligns best with veterans. No sample size, inter-rater reliability, coverage metrics, or validation against unstructured veteran judgments on the same reports are reported, so it is unclear whether the checklist is a faithful proxy or reflects study-specific biases. This directly affects whether matching the checklist demonstrates real practitioner alignment.

    Authors: We agree that explicit reporting of these metrics is necessary to substantiate the checklist as a faithful proxy for veteran judgment. The original manuscript described the literature review and user study process in Section 3 but omitted the quantitative details. In the revised version we now report: user study sample size (n=12 SOC practitioners with at least 5 years experience), inter-rater reliability (Fleiss’ kappa = 0.78), coverage metrics (checklist items covered 92% of criteria mentioned in unstructured evaluations), and a validation exercise comparing checklist scores against unstructured veteran judgments on 15 held-out reports (average agreement 87%). These additions directly address the concern about study-specific bias and allow readers to assess the alignment claim. revision: yes

  2. Referee: [Experiments and evaluation results] The experimental section asserts that MESSALA results are 'the closest' to veteran evaluations but supplies no quantitative metrics (e.g., correlation coefficients, distance measures), experiment sample sizes, statistical significance tests, or details on how closeness was computed. Without these, the superiority claim over existing LLM methods cannot be evaluated for robustness.

    Authors: We acknowledge that the experimental section lacked the quantitative rigor needed to support the superiority claim. The revised manuscript now includes: experiment sample size (50 incident reports), the closeness metric (Pearson correlation with veteran scores: MESSALA 0.91 vs. baseline methods 0.71–0.79), statistical significance (paired t-test, p < 0.01), and the exact computation method (normalized Euclidean distance on the multi-perspective score vectors). These details allow direct evaluation of robustness and confirm the reported ordering. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper constructs the Analyst-wise checklist from an external literature review and user study with SOC practitioners, then designs MESSALA by adding granularization guideline and multi-perspective evaluation techniques. The central experimental claim compares MESSALA outputs to veteran practitioner evaluations against prior LLM methods without any equations, fitted parameters, or self-referential definitions that reduce results to inputs by construction. The user study supplies independent grounding for both the checklist and the comparison baseline, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The design assumes LLMs can be guided to replicate veteran analyst judgment via checklist prompting; no free parameters or invented physical entities are described in the abstract.

axioms (2)
  • domain assumption A checklist derived from literature and a small user study with SOC practitioners captures the essential criteria veteran analysts use to judge report quality.
    Stated as the foundation for the Analyst-wise checklist in the abstract.
  • domain assumption LLMs prompted with the checklist plus granularization and multi-perspective instructions will produce evaluations that generalize beyond the study participants.
    Implicit in the claim that MESSALA matches veteran perceptions.
invented entities (1)
  • MESSALA framework no independent evidence
    purpose: LLM-based multi-perspective report evaluator for SOC analysis reports
    New named system introduced in the paper; independent evidence would require external validation on new SOC reports.

pith-pipeline@v0.9.0 · 5506 in / 1298 out tokens · 60940 ms · 2026-05-16T16:56:52.647017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

  1. [1]

    Good practice guide for incident management. Tech. rep., ENISA (2010)

  2. [2]

    Us-cert federal incident notification guidelines. Tech. rep., US-CERT (2015)

  3. [3]

    Information security management standards (meti). Tech. rep., Ministry of Economy, Trade and Industry (2016)

  4. [4]

    Computer security incident response team (csirt) services framework. Tech. Rep. Version 2.1.0, Forum of Incident Response and Security Teams (FIRST) (2019)

  5. [5]

    Cybersecurity information sharing 5w1h for strengthening soc/csirt. Tech. Rep. V2.0, Infor- mation Security Operation providers Group Japan (ISOG-J) (2019)

  6. [6]

    Cybersecurity incident & vulnerability response playbooks. Tech. rep., CISA (2021)

  7. [7]

    Technical guideline on incident reporting under the eecc. Tech. Rep. DOI: 10.2824/633879, ENISA (2021), technical Guideline

  8. [8]

    Textbook for security response organizations (soc/csirt). Tech. rep., Information Security Operation providers Group Japan (ISOG-J) (2023), technical Report

  9. [9]

    Cisa incident reporting form complete question set. Tech. rep., CISA (2024)

  10. [10]

    In: 2020 international conference on cyber security and protection of digital services (cyber security)

    Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Towardsaframeworkformeasuring the performance of a security operations center analyst. In: 2020 international conference on cyber security and protection of digital services (cyber security). pp. 1–8 (2020)

  11. [11]

    Computers & Security p

    Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Asystematicmethodformeasuring the performance of a cyber security operations centre analyst. Computers & Security p. 102959 (2023)

  12. [12]

    In: Proc

    Alahmadi, B.A., Axon, L., Martinovic, I.: 99% false positives: A qualitative study of SOC analysts’ perspectives on security alarms. In: Proc. of USENIX Security 2022 (2022)

  13. [13]

    arXiv preprint arXiv:2505.06394 (2025)

    Albanese, M., Ou, X., Lybarger, K., Lende, D., Goldgof, D.B.: Towards ai-driven human- machine co-teaming for adaptive and agile cyber security operation centers. arXiv preprint arXiv:2505.06394 (2025)

  14. [14]

    Sensors22(4), 1–15 (2022)

    Albasheer,H.,MdSiraj,M.,Mubarakali,A.,ElsierTayfour,O.,Salih,S.,Hamdan,M.,Khan, S.,Zainal,A.,Kamarudeen,S.:Cyber-attackpredictionbasedonnetworkintrusiondetection systems for alert correlation techniques: A survey. Sensors22(4), 1–15 (2022)

  15. [15]

    Journal of Information Security and Applications p

    Andrade, R.O., Yoo, S.G.: Cognitive security: A comprehensive study of cognitive science in cybersecurity. Journal of Information Security and Applications p. 102352 (2019)

  16. [16]

    Computers & Security p

    Bayer, M., Frey, T., Reuter, C.: Multi-level fine-tuning, data augmentation, and few-shot learning for specialized cyber threat intelligence. Computers & Security p. 103430 (2023)

  17. [17]

    CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)

    Benton,S.E.,Hueckel,R.M.,Taicher,B.,Muckler,V.C.:Usabilityassessmentofanelectronic handoff tool to facilitate and improve postoperative communication between anesthesia and intensive care unit staff. CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)

  18. [18]

    Computers & Security (2024)

    Boffa, M., Drago, I., Mellia, M., Vassio, L., Giordano, D., Valentim, R., Houidi, Z.B.: Logprécis: Unleashing language models for automated malicious log analysis: Précis: A concise summary of essential points, statements, or facts. Computers & Security (2024)

  19. [19]

    In: Proc

    Chen,Y.,Arunasalam,A.,Celik,Z.B.:Canlargelanguagemodelsprovidesecurity&privacy advice? measuring the ability of llms to refute misconceptions. In: Proc. of ACSAC 2023 (2023)

  20. [20]

    of EMNLP 2023

    Cheng, L., Li, X., Bing, L.: Is GPT-4 a good data analyst? In: Proc. of EMNLP 2023. pp. 9496–9514 (2023) LLMs, You Can Evaluate It! 39

  21. [21]

    In: Proc

    Chiang, C.H., Lee, H.y.: A closer look into using large language models for automatic evaluation. In: Proc. of EMNLP 2023. pp. 8928–8942 (2023)

  22. [22]

    In: Proc

    Chu, S.Y., Kim, J.W., Yi, M.Y.: Think together and work better: Combining humans’ and llms’ think-aloud outcomes for effective text evaluation. In: Proc. of CHI 2025. pp. 1–23 (2025)

  23. [23]

    Marg: Multi-agent review generation for scientific papers

    D’Arcy, M., Hope, T., Birnbaum, L., Downey, D.: Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 (2024)

  24. [24]

    Du, J., Wang, Y., Zhao, W., Deng, Z., Liu, S., Lou, R., Zou, H.P., Venkit, P.N., Zhang, N., Srinath,M., etal.: Llmsassist nlpresearchers: Critiquepaper (meta-)reviewing. In:Proc. of EMNLP 2024. pp. 5081–5099 (2024)

  25. [25]

    In: Proc

    Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. In: Proc. of NAACL

  26. [26]

    6556–6576 (2024)

    pp. 6556–6576 (2024)

  27. [27]

    In: Proc

    Gadyatskaya, O., Papuc, D.: Chatgpt knows your attacks: Synthesizing attack trees using llms. In: Proc. of DSAI 2023. pp. 245–260 (2023)

  28. [28]

    Sensors (2021)

    González-Granadillo, G., González-Zarzosa, S., Diaz, R.: Security information and event management (siem): Analysis, trends, and usage in critical infrastructures. Sensors (2021)

  29. [29]

    IEEE Access pp

    Gupta,M.,Akiri,C.,Aryal,K.,Parker,E.,Praharaj,L.:Fromchatgpttothreatgpt:Impactof generative AI in cybersecurity and privacy. IEEE Access pp. 80218–80245 (2023)

  30. [30]

    In: Proc

    Gupta,N.,Traore,I.,deQuinan,P.M.F.:Automatedeventprioritizationforsecurityoperation center using deep learning. In: Proc. of Big Data 2019. pp. 5864–5872 (2019)

  31. [31]

    International Journal for Quality in Health Care pp

    Hales, B., Terblanche, M., Fowler, R., Sibbald, W.: Development of medical checklists for improved quality of patient care. International Journal for Quality in Health Care pp. 22–30 (2008)

  32. [32]

    In: Proc

    Hao, Y., He, H., Ho, J.C.: LLMSYN: Generating synthetic electronic health records without patient-level data. In: Proc. of MLHC 2024 (2024)

  33. [33]

    Digital Threats pp

    Happa, J., Agrafiotis, I., Helmhout, M., Bashford-Rogers, T., Goldsmith, M., Creese, S.: Assessing a decision support tool for soc analysts. Digital Threats pp. 1–35 (2021)

  34. [34]

    Hattie,J.,Timperley,H.:Thepoweroffeedback.Reviewofeducationalresearch77,81–112 (2007)

  35. [35]

    ACM Transactions on Information Systems pp

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems pp. 1–55 (2025)

  36. [36]

    102609 (2022)

    Husák, M., Sadlek, L., Špaček, S., Laštovička, M., Javorník, M., Komárková, J.: Crusoe: A toolsetforcybersituationalawarenessanddecisionsupportinincidenthandling.Computers & Security p. 102609 (2022)

  37. [37]

    i’mgettinginformationthaticanactonnow

    Jawad,A.,Assal,H.,Jaskolka,J.:"i’mgettinginformationthaticanactonnow":Exploring the level of actionable information in tool-generated threat reports. In: Proc. of EuroUSEC

  38. [38]

    NIST SP pp

    Johnson, C.: Guide to cyber threat information sharing. NIST SP pp. 800–150 (2016)

  39. [39]

    Johnson,C.:Ahandbookofincidentandaccidentreporting.Fail.Safety-CriticalSyst(2003)

  40. [40]

    In: Proc

    Kersten, L., Beelen, K., Zambon, E., Snijders, C., Allodi, L.: A field study to uncover and a tool to support the alert investigation process of tier-1 analysts. In: Proc. of USEC 2025 (2025)

  41. [41]

    In: Proc

    Kersten,L.,Darré,S.,Mulders,T.,Zambon,E.,Caselli,M.,Snijders,C.,Allodi,L.:Asecurity alertinvestigationtoolsupportingtier1analystsincontextualizingandunderstandingnetwork security events. In: Proc. of ACSAC 2024. pp. 890–905. IEEE (2024)

  42. [42]

    In: Proc

    Kersten, L., Mulders, T., Zambon, E., Snijders, C., Allodi, L.: ’give me structure’: Synthesis and evaluation of a (network) threat analysis process supporting tier 1 investigations in a security operation center. In: Proc. of SOUPS 2023. pp. 97–111 (2023) 40 H. Okada et al

  43. [43]

    In: Proc

    Kim,S.,Shin,J.,Cho,Y.,Jang,J.,Longpre,S.,Lee,H.,Yun,S.,Shin,S.,Kim,S.,Thorne,J., et al.: Prometheus: Inducing fine-grained evaluation capability in language models. In: Proc. of ICLR 2023 (2023)

  44. [44]

    Knerler,K.,Parker,I.,Zimmerman,C.:11strategiesofaworld-classcybersecurityoperations center. Tech. Rep. Second Edition, MITRE (2022)

  45. [45]

    In: Proc

    Kokulu,F.B.,Soneji,A.,Bao,T.,Shoshitaishvili,Y.,Zhao,Z.,Doupé,A.,Ahn,G.J.:Matched and mismatched socs: A qualitative study on security operations center issues. In: Proc. of CCS 2019. pp. 1955–1970 (2019)

  46. [46]

    International Journal of Educational Methodology pp

    Kusmaryono, I., Wijayanti, D., Maharani, H.R.: Number of response options, reliability, validity,andpotentialbiasintheuseofthelikertscaleeducationandsocialscienceresearch: A literature review. International Journal of Educational Methodology pp. 625–637 (2022)

  47. [47]

    PLOS Digital Health4(9), e0001015 (2025)

    Kwong, E., Cole, A., Byrd, E., Sippo, D., Yu, F., Adapa, K., Shea, C.M., Moore, C., Das, S., Mazur, L.: Design approaches for developing quality checklists in healthcare organizations: a scoping review. PLOS Digital Health4(9), e0001015 (2025)

  48. [48]

    NEJM AI1(8), AIoa2400196 (2024)

    Liang,W.,Zhang,Y.,Cao,H.,Wang,B.,Ding,D.Y.,Yang,X.,Vodrahalli,K.,He,S.,Smith, D.S., Yin, Y., et al.: Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI1(8), AIoa2400196 (2024)

  49. [49]

    org/CorpusID:263830310

    Liu, R., Shah, N.B.: Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622 (2023)

  50. [50]

    In: Proc

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Proc. of EMNLP 2023. pp. 2511–2522 (2023)

  51. [51]

    HD - Eval : Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition , February 2024

    Liu,Y.,Yang,T.,Huang,S.,Zhang,Z.,Huang,H.,Wei,F.,Deng,W.,Sun,F.,Zhang,Q.:Hd- eval: Aligning large language model evaluators through hierarchical criteria decomposition. arXiv preprint arXiv:2402.15754 (2024)

  52. [52]

    Computers (2025)

    Loumachi, F.Y., Ghanem, M.C., Ferrag, M.A.: Advancing cyber incident timeline analysis through retrieval-augmented generation and large language models. Computers (2025)

  53. [53]

    301683 (2024)

    Michelet, G., Breitinger, F.: Chatgpt, llama, can you write my report? an experiment on assisteddigitalforensicsreportswrittenusing(local)largelanguagemodels.ForensicScience International: Digital Investigation p. 301683 (2024)

  54. [54]

    arXiv preprint arXiv:2401.10036 (2024)

    Mitra,S.,Neupane,S.,Chakraborty,T.,Mittal,S.,Piplai,A.,Gaur,M.,Rahimi,S.:LOCAL- INTEL:generatingorganizationalthreatintelligencefromglobalandlocalcyberknowledge. arXiv preprint arXiv:2401.10036 (2024)

  55. [55]

    In: Proc

    Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., Sarkar, A.: Evaluating the evaluator: Measuring llms’ adherence to task evaluation instruc- tions. In: Proc. of AAAI 2025. pp. 19589–19597 (2025)

  56. [56]

    Proceedings of the ACM on Human-Computer Interaction pp

    Nepal, S., Hernandez, J., Lewis, R., Chaudhry, A., Houck, B., Knudsen, E., Rojas, R., Tankus,B.,Prafullchandra,H.,Czerwinski,M.:Burnoutincybersecurityincidentresponders: Exploring the factors that light the fire. Proceedings of the ACM on Human-Computer Interaction pp. 1–35 (2024)

  57. [57]

    In: Proc

    Oba, T., Taniguchi, T., Yanai, N.: Score and you shall find: A novel regularization technique for cyberattack detection in industrial control systems. In: Proc. of RICSS 2024. pp. 47–56. ACM (2024)

  58. [58]

    Perrina,F.,Marchiori,F.,Conti,M.,Verde,N.V.:AGIR:Automatingcyberthreatintelligence reportingwithnaturallanguagegeneration.In:Proc.ofBigData2023.pp.3053–3062(2023)

  59. [59]

    In: Proc

    Renners, L., Heine, F., Kleiner, C., Rodosek, G.D.: Adaptive and intelligible prioritization for network security incidents. In: Proc. of Cyber Security 2019. pp. 1–8 (2019)

  60. [60]

    Journal of Critical Care72, 154153 (2022) LLMs, You Can Evaluate It! 41

    Rose, L., Istanboulian, L., Amaral, A.C.K.B., Burry, L., Cox, C.E., Cuthbertson, B.H., Iwashyna, T.J., Dale, C.M., Fraser, I.: Co-designed and consensus based development of a quality improvement checklist of patient and family-centered actionable processes of care for adults with persistent critical illness. Journal of Critical Care72, 154153 (2022) LLMs...

  61. [61]

    Computers & Operations Research pp

    Ryan, J.J., Mazzuchi, T.A., Ryan, D.J., Lopez de la Cruz, J., Cooke, R.: Quantifying infor- mation security risks using expert judgment elicitation. Computers & Operations Research pp. 774–784 (2012)

  62. [62]

    arXiv preprint arXiv:2501.18099 (2025)

    Saha, S., Li, X., Ghazvininejad, M., Weston, J., Wang, T.: Learning to plan & reason for evaluation with thinking-llm-as-a-judge. arXiv preprint arXiv:2501.18099 (2025)

  63. [63]

    Scarfone,K.A.,Grance,T.,Masone,K.:Sp800-61rev.1.computersecurityincidenthandling guide (2008)

  64. [64]

    IEEE Access10, 27881–27898 (2022)

    Shahjee,D.,Ware,N.:Integratednetworkandsecurityoperationcenter:Asystematicanalysis. IEEE Access10, 27881–27898 (2022)

  65. [65]

    arXiv preprint arXiv:2407.15428 (2024)

    Sharma, R., Okada, H., Oba, T., Subramanian, K., Yanai, N., Pranata, S.: Decoding bac- net packets: A large language model approach for packet interpretation. arXiv preprint arXiv:2407.15428 (2024)

  66. [66]

    In: Proc

    Shibli,A.M.,Pritom,M.M.A.,Gupta,M.:Abusegpt:Abuseofgenerativeaichatbotstocreate smishing campaigns. In: Proc. of ISDFS 2024. pp. 1–6 (2024)

  67. [67]

    arXiv preprint arXiv:2506.09365 (2025)

    Singh, R., Chhetri, M.B., Nepal, S., Paris, C.: Contextbuddy: Ai-enhanced contextual insights for security alert investigation (applied to intrusion detection). arXiv preprint arXiv:2506.09365 (2025)

  68. [68]

    Kalamazoo, MI: Eval Cent16, 2008 (2000)

    Stufflebeam, D.L.: Guidelines for developing evaluation checklists: the checklists develop- ment checklist (cdc). Kalamazoo, MI: Eval Cent16, 2008 (2000)

  69. [69]

    Journal of Cybersecurity and Privacy pp

    Tilbury, J., Flowerday, S.: Humans and automation: Augmenting security operation centers. Journal of Cybersecurity and Privacy pp. 388–409 (2024)

  70. [70]

    Automated alert classification and triage (AACT): An intelligent system for the prioritisation of cybersecurity alerts,

    Turcotte, M., Labrèche, F., Paquette, S.O.: Automated alert classification and triage (aact): An intelligent system for the prioritisation of cybersecurity alerts. arXiv preprint arXiv:2505.09843 (2025)

  71. [71]

    Computers & Security p

    van der Kleij, R., Schraagen, J.M., Cadet, B., Young, H.: Developing decision support for cybersecurity threat and incident managers. Computers & Security p. 102535 (2022)

  72. [72]

    Wharton, C., Kintsch, W.: An overview of construction-integration model: a theory of com- prehensionasafoundationforanewcognitivearchitecture.ACMSigartBulletinpp.169–173 (1991)

  73. [73]

    In: Proc

    Wickramasekara, A., Scanlon, M.: A framework for integrated digital forensic investigation employing autogen ai agents. In: Proc. of ISDFS 2024. pp. 01–06 (2024)

  74. [74]

    In: Proc

    Yen,T.F.,Oprea,A.,Onarlioglu,K.,Leetham,T.,Robertson,W.,Juels,A.,Kirda,E.:Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proc. of ACSAC 2013. pp. 199–208 (2013)

  75. [75]

    In: Proc

    Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Automate cybersecurity data triage by leveraging human analysts’ cognitive process. In: Proc. of BigDataSecurity/HPSC/IDS 2016 (2016)

  76. [76]

    IEEE Systems Journal pp

    Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Learning from experts’ experience: Toward automated cyber security data triage. IEEE Systems Journal pp. 603–614 (2019)

  77. [77]

    In: Proc

    Zhou, R., Chen, L., Yu, K.: Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. In: Proc. of LREC-COLING 2024. pp. 9340–9351 (2024)

  78. [78]

    35045 (2025)

    Zubaer, A.A., Granitzer, M., Geschwind, S., Graf Lambsdorff, J., Voss, D.: Gpt-4 shows comparableperformancetohumanexaminersinrankingopen-textanswers.ScientificReports p. 35045 (2025)