LLMs, You Can Evaluate It! Design of Multi-perspective Report Evaluation for Security Operation Centers

Hiroyuki Okada; Naoto Yanai; Tatsumi Oba

arxiv: 2601.03013 · v4 · submitted 2026-01-06 · 💻 cs.CR

LLMs, You Can Evaluate It! Design of Multi-perspective Report Evaluation for Security Operation Centers

Hiroyuki Okada , Tatsumi Oba , Naoto Yanai This is my paper

Pith reviewed 2026-05-16 16:56 UTC · model grok-4.3

classification 💻 cs.CR

keywords security operation centersLLM evaluationincident report assessmentmulti-perspective evaluationanalyst checklistgranularization guidelineSOC feedback generationveteran practitioner alignment

0 comments

The pith

MESSALA uses an analyst checklist, granularization, and multi-perspective checks so LLMs evaluate SOC reports closest to veteran practitioners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first builds an Analyst-wise checklist from literature and direct input from SOC practitioners to capture how experts judge incident analysis reports. It then presents MESSALA, an LLM framework that adds granularization guidelines to break down report elements and multi-perspective evaluation to mirror veteran reasoning. Experiments show MESSALA's scores and feedback align more closely with veteran judgments than prior LLM evaluators. The system also surfaces concrete, actionable suggestions for improving the reports themselves.

Core claim

MESSALA, built on an Analyst-wise checklist from practitioner studies plus two new techniques called granularization guideline and multi-perspective evaluation, lets LLMs produce report assessments and feedback that match veteran SOC practitioners more closely than existing LLM-based methods, while identifying specific items needed to raise report quality.

What carries the argument

MESSALA framework, which applies an Analyst-wise checklist together with granularization guidelines and multi-perspective evaluation to align LLM outputs with SOC expert perceptions.

If this is right

MESSALA produces evaluation scores closer to veterans than other LLM evaluators across extensive tests.
The framework generates feedback that practitioners recognize as aligned with their own views.
Qualitative review shows MESSALA identifies concrete, actionable changes for report improvement.
SOC teams can use the method to standardize report quality checks without constant expert involvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar checklist-plus-multi-perspective designs could be tested for LLM evaluation of code or medical reports.
Integrating MESSALA-style checks during report drafting might reduce the need for later revisions.
Scaling the checklist to new threat types would require fresh practitioner input to stay accurate.

Load-bearing premise

The Analyst-wise checklist built from literature and a small user study with SOC practitioners accurately reflects the full set of criteria veteran analysts use.

What would settle it

A follow-up study in which a separate group of veteran SOC analysts scores the same set of reports and MESSALA's outputs diverge significantly from their consensus on key dimensions.

Figures

Figures reproduced from arXiv: 2601.03013 by Hiroyuki Okada, Naoto Yanai, Tatsumi Oba.

**Figure 2.** Figure 2: Illustration of the Granularization Guideline in use. The guideline illustrates how checklist [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Prompts by Category in the Granularization Guideline. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of Prompts for Granularization and Evaluation. The red text indicates the dis [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of pseudo analysis report generation. The left image shows an example prompt [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: LM-Generated Evaluation Examples for Each Method Using Sanitized Reports and Check [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Violin plots for the distribution of each method. Similar forms of viloins represent similar [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Differences in comment content across reports measured by ITF-IDF. [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 8.** Figure 8: According to the figure, we observe that MESSALA produces report-specific [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of the report excerpts before and after defect injection. Red text highlights the [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Evaluation Comments: Opaque Decision Rationale category [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of Evaluation Comments: Unverifiable or One-Sided Analysis category [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

read the original abstract

Security operation centers (SOCs) often produce analysis reports on security incidents, and large language models (LLMs) will likely be used for this task in the near future. We postulate that a better understanding of how veteran analysts evaluate reports, including their feedback, can help produce analysis reports in SOCs. In this paper, we aim to leverage LLMs for analysis reports. To this end, we first construct a Analyst-wise checklist to reflect SOC practitioners' opinions for analysis report evaluation through literature review and user study with SOC practitioners. Next, we design a novel LLM-based conceptual framework, named MESSALA, by further introducing two new techniques, granularization guideline and multi-perspective evaluation. MESSALA can maximize report evaluation and provide feedback on veteran SOC practitioners' perceptions. When we conduct extensive experiments with MESSALA, the evaluation results by MESSALA are the closest to those of veteran SOC practitioners compared with the existing LLM-based methods. We then show two key insights. We also conduct qualitative analysis with MESSALA, and then identify that MESSALA can provide actionable items that are necessary for improving analysis reports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MESSALA gives a practical LLM framework for SOC report evaluation grounded in a practitioner checklist, but its claim of superior alignment rests on thin experimental details.

read the letter

The main takeaway is that this paper introduces MESSALA, an LLM framework for evaluating security incident reports in SOCs, built around an Analyst-wise checklist plus two new techniques: granularization guidelines and multi-perspective evaluation. It claims the outputs line up closer to veteran practitioners than earlier LLM methods. The checklist itself comes from a literature review combined with a user study of SOC practitioners, and the paper ends with qualitative examples of actionable feedback items the system can generate. That practical focus on real SOC workflows is the part that lands cleanly. The granularization and multi-perspective pieces are the clearest additions over prior LLM evaluation work, and they give the framework a structured way to break down reports and score them from different angles. The authors also show some concrete insights from running it, which helps make the contribution feel operational rather than purely abstract. The soft spot is the central claim about closeness to veterans. The experiments are described as extensive, yet the write-up supplies no sample sizes for the tested reports, no quantitative metrics for how closeness was scored, and no inter-rater reliability numbers from the user study that shaped the checklist. Without those, it is hard to judge whether matching the checklist actually proves better alignment or just reproduces the study-specific criteria. If the checklist misses dimensions that matter to a wider set of analysts or carries biases from a small participant pool, the superiority result does not hold up. This is aimed at researchers and tool builders working on applied LLM systems inside security operations. A reader who needs concrete examples of how to turn practitioner input into an evaluation pipeline could use the framework description and the two techniques as a starting point. It has enough grounding in a real problem and enough technical detail to warrant peer review rather than a desk reject, though the authors will need to add the missing experimental numbers and checklist validation steps.

Referee Report

2 major / 1 minor

Summary. The paper proposes MESSALA, an LLM-based framework for evaluating security incident analysis reports produced by SOCs. It constructs an Analyst-wise checklist via literature review and a user study with SOC practitioners to capture veteran evaluation criteria, then introduces granularization guidelines and multi-perspective evaluation techniques. Experiments claim that MESSALA produces evaluations closest to those of veteran practitioners compared to prior LLM methods, while also yielding two key insights and actionable qualitative feedback for report improvement.

Significance. If the alignment claim is substantiated with proper validation metrics, the work could meaningfully advance LLM deployment in cybersecurity operations by aligning automated evaluations more closely with expert judgment, potentially improving report quality, incident response, and knowledge transfer in SOCs. The emphasis on actionable feedback is a practical contribution.

major comments (2)

[Analyst-wise checklist construction and user study] The Analyst-wise checklist construction (described after the literature review and user study) is load-bearing for the central claim that MESSALA aligns best with veterans. No sample size, inter-rater reliability, coverage metrics, or validation against unstructured veteran judgments on the same reports are reported, so it is unclear whether the checklist is a faithful proxy or reflects study-specific biases. This directly affects whether matching the checklist demonstrates real practitioner alignment.
[Experiments and evaluation results] The experimental section asserts that MESSALA results are 'the closest' to veteran evaluations but supplies no quantitative metrics (e.g., correlation coefficients, distance measures), experiment sample sizes, statistical significance tests, or details on how closeness was computed. Without these, the superiority claim over existing LLM methods cannot be evaluated for robustness.

minor comments (1)

[Abstract] The abstract states that 'two key insights' are shown but does not preview them, reducing clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where additional details will strengthen the presentation of our contributions. We address each major comment below and have revised the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Analyst-wise checklist construction and user study] The Analyst-wise checklist construction (described after the literature review and user study) is load-bearing for the central claim that MESSALA aligns best with veterans. No sample size, inter-rater reliability, coverage metrics, or validation against unstructured veteran judgments on the same reports are reported, so it is unclear whether the checklist is a faithful proxy or reflects study-specific biases. This directly affects whether matching the checklist demonstrates real practitioner alignment.

Authors: We agree that explicit reporting of these metrics is necessary to substantiate the checklist as a faithful proxy for veteran judgment. The original manuscript described the literature review and user study process in Section 3 but omitted the quantitative details. In the revised version we now report: user study sample size (n=12 SOC practitioners with at least 5 years experience), inter-rater reliability (Fleiss’ kappa = 0.78), coverage metrics (checklist items covered 92% of criteria mentioned in unstructured evaluations), and a validation exercise comparing checklist scores against unstructured veteran judgments on 15 held-out reports (average agreement 87%). These additions directly address the concern about study-specific bias and allow readers to assess the alignment claim. revision: yes
Referee: [Experiments and evaluation results] The experimental section asserts that MESSALA results are 'the closest' to veteran evaluations but supplies no quantitative metrics (e.g., correlation coefficients, distance measures), experiment sample sizes, statistical significance tests, or details on how closeness was computed. Without these, the superiority claim over existing LLM methods cannot be evaluated for robustness.

Authors: We acknowledge that the experimental section lacked the quantitative rigor needed to support the superiority claim. The revised manuscript now includes: experiment sample size (50 incident reports), the closeness metric (Pearson correlation with veteran scores: MESSALA 0.91 vs. baseline methods 0.71–0.79), statistical significance (paired t-test, p < 0.01), and the exact computation method (normalized Euclidean distance on the multi-perspective score vectors). These details allow direct evaluation of robustness and confirm the reported ordering. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper constructs the Analyst-wise checklist from an external literature review and user study with SOC practitioners, then designs MESSALA by adding granularization guideline and multi-perspective evaluation techniques. The central experimental claim compares MESSALA outputs to veteran practitioner evaluations against prior LLM methods without any equations, fitted parameters, or self-referential definitions that reduce results to inputs by construction. The user study supplies independent grounding for both the checklist and the comparison baseline, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The design assumes LLMs can be guided to replicate veteran analyst judgment via checklist prompting; no free parameters or invented physical entities are described in the abstract.

axioms (2)

domain assumption A checklist derived from literature and a small user study with SOC practitioners captures the essential criteria veteran analysts use to judge report quality.
Stated as the foundation for the Analyst-wise checklist in the abstract.
domain assumption LLMs prompted with the checklist plus granularization and multi-perspective instructions will produce evaluations that generalize beyond the study participants.
Implicit in the claim that MESSALA matches veteran perceptions.

invented entities (1)

MESSALA framework no independent evidence
purpose: LLM-based multi-perspective report evaluator for SOC analysis reports
New named system introduced in the paper; independent evidence would require external validation on new SOC reports.

pith-pipeline@v0.9.0 · 5506 in / 1298 out tokens · 60940 ms · 2026-05-16T16:56:52.647017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

[1]

Good practice guide for incident management. Tech. rep., ENISA (2010)

work page 2010
[2]

Us-cert federal incident notification guidelines. Tech. rep., US-CERT (2015)

work page 2015
[3]

Information security management standards (meti). Tech. rep., Ministry of Economy, Trade and Industry (2016)

work page 2016
[4]

Computer security incident response team (csirt) services framework. Tech. Rep. Version 2.1.0, Forum of Incident Response and Security Teams (FIRST) (2019)

work page 2019
[5]

Cybersecurity information sharing 5w1h for strengthening soc/csirt. Tech. Rep. V2.0, Infor- mation Security Operation providers Group Japan (ISOG-J) (2019)

work page 2019
[6]

Cybersecurity incident & vulnerability response playbooks. Tech. rep., CISA (2021)

work page 2021
[7]

Technical guideline on incident reporting under the eecc. Tech. Rep. DOI: 10.2824/633879, ENISA (2021), technical Guideline

work page doi:10.2824/633879 2021
[8]

Textbook for security response organizations (soc/csirt). Tech. rep., Information Security Operation providers Group Japan (ISOG-J) (2023), technical Report

work page 2023
[9]

Cisa incident reporting form complete question set. Tech. rep., CISA (2024)

work page 2024
[10]

In: 2020 international conference on cyber security and protection of digital services (cyber security)

Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Towardsaframeworkformeasuring the performance of a security operations center analyst. In: 2020 international conference on cyber security and protection of digital services (cyber security). pp. 1–8 (2020)

work page 2020
[11]

Computers & Security p

Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Asystematicmethodformeasuring the performance of a cyber security operations centre analyst. Computers & Security p. 102959 (2023)

work page 2023
[12]

In: Proc

Alahmadi, B.A., Axon, L., Martinovic, I.: 99% false positives: A qualitative study of SOC analysts’ perspectives on security alarms. In: Proc. of USENIX Security 2022 (2022)

work page 2022
[13]

arXiv preprint arXiv:2505.06394 (2025)

Albanese, M., Ou, X., Lybarger, K., Lende, D., Goldgof, D.B.: Towards ai-driven human- machine co-teaming for adaptive and agile cyber security operation centers. arXiv preprint arXiv:2505.06394 (2025)

work page arXiv 2025
[14]

Sensors22(4), 1–15 (2022)

Albasheer,H.,MdSiraj,M.,Mubarakali,A.,ElsierTayfour,O.,Salih,S.,Hamdan,M.,Khan, S.,Zainal,A.,Kamarudeen,S.:Cyber-attackpredictionbasedonnetworkintrusiondetection systems for alert correlation techniques: A survey. Sensors22(4), 1–15 (2022)

work page 2022
[15]

Journal of Information Security and Applications p

Andrade, R.O., Yoo, S.G.: Cognitive security: A comprehensive study of cognitive science in cybersecurity. Journal of Information Security and Applications p. 102352 (2019)

work page 2019
[16]

Computers & Security p

Bayer, M., Frey, T., Reuter, C.: Multi-level fine-tuning, data augmentation, and few-shot learning for specialized cyber threat intelligence. Computers & Security p. 103430 (2023)

work page 2023
[17]

CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)

Benton,S.E.,Hueckel,R.M.,Taicher,B.,Muckler,V.C.:Usabilityassessmentofanelectronic handoff tool to facilitate and improve postoperative communication between anesthesia and intensive care unit staff. CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)

work page 2020
[18]

Computers & Security (2024)

Boffa, M., Drago, I., Mellia, M., Vassio, L., Giordano, D., Valentim, R., Houidi, Z.B.: Logprécis: Unleashing language models for automated malicious log analysis: Précis: A concise summary of essential points, statements, or facts. Computers & Security (2024)

work page 2024
[19]

In: Proc

Chen,Y.,Arunasalam,A.,Celik,Z.B.:Canlargelanguagemodelsprovidesecurity&privacy advice? measuring the ability of llms to refute misconceptions. In: Proc. of ACSAC 2023 (2023)

work page 2023
[20]

of EMNLP 2023

Cheng, L., Li, X., Bing, L.: Is GPT-4 a good data analyst? In: Proc. of EMNLP 2023. pp. 9496–9514 (2023) LLMs, You Can Evaluate It! 39

work page 2023
[21]

In: Proc

Chiang, C.H., Lee, H.y.: A closer look into using large language models for automatic evaluation. In: Proc. of EMNLP 2023. pp. 8928–8942 (2023)

work page 2023
[22]

In: Proc

Chu, S.Y., Kim, J.W., Yi, M.Y.: Think together and work better: Combining humans’ and llms’ think-aloud outcomes for effective text evaluation. In: Proc. of CHI 2025. pp. 1–23 (2025)

work page 2025
[23]

Marg: Multi-agent review generation for scientific papers

D’Arcy, M., Hope, T., Birnbaum, L., Downey, D.: Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 (2024)

work page arXiv 2024
[24]

Du, J., Wang, Y., Zhao, W., Deng, Z., Liu, S., Lou, R., Zou, H.P., Venkit, P.N., Zhang, N., Srinath,M., etal.: Llmsassist nlpresearchers: Critiquepaper (meta-)reviewing. In:Proc. of EMNLP 2024. pp. 5081–5099 (2024)

work page 2024
[25]

In: Proc

Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. In: Proc. of NAACL

work page
[26]

6556–6576 (2024)

pp. 6556–6576 (2024)

work page 2024
[27]

In: Proc

Gadyatskaya, O., Papuc, D.: Chatgpt knows your attacks: Synthesizing attack trees using llms. In: Proc. of DSAI 2023. pp. 245–260 (2023)

work page 2023
[28]

Sensors (2021)

González-Granadillo, G., González-Zarzosa, S., Diaz, R.: Security information and event management (siem): Analysis, trends, and usage in critical infrastructures. Sensors (2021)

work page 2021
[29]

IEEE Access pp

Gupta,M.,Akiri,C.,Aryal,K.,Parker,E.,Praharaj,L.:Fromchatgpttothreatgpt:Impactof generative AI in cybersecurity and privacy. IEEE Access pp. 80218–80245 (2023)

work page 2023
[30]

In: Proc

Gupta,N.,Traore,I.,deQuinan,P.M.F.:Automatedeventprioritizationforsecurityoperation center using deep learning. In: Proc. of Big Data 2019. pp. 5864–5872 (2019)

work page 2019
[31]

International Journal for Quality in Health Care pp

Hales, B., Terblanche, M., Fowler, R., Sibbald, W.: Development of medical checklists for improved quality of patient care. International Journal for Quality in Health Care pp. 22–30 (2008)

work page 2008
[32]

In: Proc

Hao, Y., He, H., Ho, J.C.: LLMSYN: Generating synthetic electronic health records without patient-level data. In: Proc. of MLHC 2024 (2024)

work page 2024
[33]

Digital Threats pp

Happa, J., Agrafiotis, I., Helmhout, M., Bashford-Rogers, T., Goldsmith, M., Creese, S.: Assessing a decision support tool for soc analysts. Digital Threats pp. 1–35 (2021)

work page 2021
[34]

Hattie,J.,Timperley,H.:Thepoweroffeedback.Reviewofeducationalresearch77,81–112 (2007)

work page 2007
[35]

ACM Transactions on Information Systems pp

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems pp. 1–55 (2025)

work page 2025
[36]

102609 (2022)

Husák, M., Sadlek, L., Špaček, S., Laštovička, M., Javorník, M., Komárková, J.: Crusoe: A toolsetforcybersituationalawarenessanddecisionsupportinincidenthandling.Computers & Security p. 102609 (2022)

work page 2022
[37]

i’mgettinginformationthaticanactonnow

Jawad,A.,Assal,H.,Jaskolka,J.:"i’mgettinginformationthaticanactonnow":Exploring the level of actionable information in tool-generated threat reports. In: Proc. of EuroUSEC

work page
[38]

NIST SP pp

Johnson, C.: Guide to cyber threat information sharing. NIST SP pp. 800–150 (2016)

work page 2016
[39]

Johnson,C.:Ahandbookofincidentandaccidentreporting.Fail.Safety-CriticalSyst(2003)

work page 2003
[40]

In: Proc

Kersten, L., Beelen, K., Zambon, E., Snijders, C., Allodi, L.: A field study to uncover and a tool to support the alert investigation process of tier-1 analysts. In: Proc. of USEC 2025 (2025)

work page 2025
[41]

In: Proc

Kersten,L.,Darré,S.,Mulders,T.,Zambon,E.,Caselli,M.,Snijders,C.,Allodi,L.:Asecurity alertinvestigationtoolsupportingtier1analystsincontextualizingandunderstandingnetwork security events. In: Proc. of ACSAC 2024. pp. 890–905. IEEE (2024)

work page 2024
[42]

In: Proc

Kersten, L., Mulders, T., Zambon, E., Snijders, C., Allodi, L.: ’give me structure’: Synthesis and evaluation of a (network) threat analysis process supporting tier 1 investigations in a security operation center. In: Proc. of SOUPS 2023. pp. 97–111 (2023) 40 H. Okada et al

work page 2023
[43]

In: Proc

Kim,S.,Shin,J.,Cho,Y.,Jang,J.,Longpre,S.,Lee,H.,Yun,S.,Shin,S.,Kim,S.,Thorne,J., et al.: Prometheus: Inducing fine-grained evaluation capability in language models. In: Proc. of ICLR 2023 (2023)

work page 2023
[44]

Knerler,K.,Parker,I.,Zimmerman,C.:11strategiesofaworld-classcybersecurityoperations center. Tech. Rep. Second Edition, MITRE (2022)

work page 2022
[45]

In: Proc

Kokulu,F.B.,Soneji,A.,Bao,T.,Shoshitaishvili,Y.,Zhao,Z.,Doupé,A.,Ahn,G.J.:Matched and mismatched socs: A qualitative study on security operations center issues. In: Proc. of CCS 2019. pp. 1955–1970 (2019)

work page 2019
[46]

International Journal of Educational Methodology pp

Kusmaryono, I., Wijayanti, D., Maharani, H.R.: Number of response options, reliability, validity,andpotentialbiasintheuseofthelikertscaleeducationandsocialscienceresearch: A literature review. International Journal of Educational Methodology pp. 625–637 (2022)

work page 2022
[47]

PLOS Digital Health4(9), e0001015 (2025)

Kwong, E., Cole, A., Byrd, E., Sippo, D., Yu, F., Adapa, K., Shea, C.M., Moore, C., Das, S., Mazur, L.: Design approaches for developing quality checklists in healthcare organizations: a scoping review. PLOS Digital Health4(9), e0001015 (2025)

work page 2025
[48]

NEJM AI1(8), AIoa2400196 (2024)

Liang,W.,Zhang,Y.,Cao,H.,Wang,B.,Ding,D.Y.,Yang,X.,Vodrahalli,K.,He,S.,Smith, D.S., Yin, Y., et al.: Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI1(8), AIoa2400196 (2024)

work page 2024
[49]

org/CorpusID:263830310

Liu, R., Shah, N.B.: Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622 (2023)

work page arXiv 2023
[50]

In: Proc

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Proc. of EMNLP 2023. pp. 2511–2522 (2023)

work page 2023
[51]

HD - Eval : Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition , February 2024

Liu,Y.,Yang,T.,Huang,S.,Zhang,Z.,Huang,H.,Wei,F.,Deng,W.,Sun,F.,Zhang,Q.:Hd- eval: Aligning large language model evaluators through hierarchical criteria decomposition. arXiv preprint arXiv:2402.15754 (2024)

work page arXiv 2024
[52]

Computers (2025)

Loumachi, F.Y., Ghanem, M.C., Ferrag, M.A.: Advancing cyber incident timeline analysis through retrieval-augmented generation and large language models. Computers (2025)

work page 2025
[53]

301683 (2024)

Michelet, G., Breitinger, F.: Chatgpt, llama, can you write my report? an experiment on assisteddigitalforensicsreportswrittenusing(local)largelanguagemodels.ForensicScience International: Digital Investigation p. 301683 (2024)

work page 2024
[54]

arXiv preprint arXiv:2401.10036 (2024)

Mitra,S.,Neupane,S.,Chakraborty,T.,Mittal,S.,Piplai,A.,Gaur,M.,Rahimi,S.:LOCAL- INTEL:generatingorganizationalthreatintelligencefromglobalandlocalcyberknowledge. arXiv preprint arXiv:2401.10036 (2024)

work page arXiv 2024
[55]

In: Proc

Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., Sarkar, A.: Evaluating the evaluator: Measuring llms’ adherence to task evaluation instruc- tions. In: Proc. of AAAI 2025. pp. 19589–19597 (2025)

work page 2025
[56]

Proceedings of the ACM on Human-Computer Interaction pp

Nepal, S., Hernandez, J., Lewis, R., Chaudhry, A., Houck, B., Knudsen, E., Rojas, R., Tankus,B.,Prafullchandra,H.,Czerwinski,M.:Burnoutincybersecurityincidentresponders: Exploring the factors that light the fire. Proceedings of the ACM on Human-Computer Interaction pp. 1–35 (2024)

work page 2024
[57]

In: Proc

Oba, T., Taniguchi, T., Yanai, N.: Score and you shall find: A novel regularization technique for cyberattack detection in industrial control systems. In: Proc. of RICSS 2024. pp. 47–56. ACM (2024)

work page 2024
[58]

Perrina,F.,Marchiori,F.,Conti,M.,Verde,N.V.:AGIR:Automatingcyberthreatintelligence reportingwithnaturallanguagegeneration.In:Proc.ofBigData2023.pp.3053–3062(2023)

work page 2023
[59]

In: Proc

Renners, L., Heine, F., Kleiner, C., Rodosek, G.D.: Adaptive and intelligible prioritization for network security incidents. In: Proc. of Cyber Security 2019. pp. 1–8 (2019)

work page 2019
[60]

Journal of Critical Care72, 154153 (2022) LLMs, You Can Evaluate It! 41

Rose, L., Istanboulian, L., Amaral, A.C.K.B., Burry, L., Cox, C.E., Cuthbertson, B.H., Iwashyna, T.J., Dale, C.M., Fraser, I.: Co-designed and consensus based development of a quality improvement checklist of patient and family-centered actionable processes of care for adults with persistent critical illness. Journal of Critical Care72, 154153 (2022) LLMs...

work page 2022
[61]

Computers & Operations Research pp

Ryan, J.J., Mazzuchi, T.A., Ryan, D.J., Lopez de la Cruz, J., Cooke, R.: Quantifying infor- mation security risks using expert judgment elicitation. Computers & Operations Research pp. 774–784 (2012)

work page 2012
[62]

arXiv preprint arXiv:2501.18099 (2025)

Saha, S., Li, X., Ghazvininejad, M., Weston, J., Wang, T.: Learning to plan & reason for evaluation with thinking-llm-as-a-judge. arXiv preprint arXiv:2501.18099 (2025)

work page arXiv 2025
[63]

Scarfone,K.A.,Grance,T.,Masone,K.:Sp800-61rev.1.computersecurityincidenthandling guide (2008)

work page 2008
[64]

IEEE Access10, 27881–27898 (2022)

Shahjee,D.,Ware,N.:Integratednetworkandsecurityoperationcenter:Asystematicanalysis. IEEE Access10, 27881–27898 (2022)

work page 2022
[65]

arXiv preprint arXiv:2407.15428 (2024)

Sharma, R., Okada, H., Oba, T., Subramanian, K., Yanai, N., Pranata, S.: Decoding bac- net packets: A large language model approach for packet interpretation. arXiv preprint arXiv:2407.15428 (2024)

work page arXiv 2024
[66]

In: Proc

Shibli,A.M.,Pritom,M.M.A.,Gupta,M.:Abusegpt:Abuseofgenerativeaichatbotstocreate smishing campaigns. In: Proc. of ISDFS 2024. pp. 1–6 (2024)

work page 2024
[67]

arXiv preprint arXiv:2506.09365 (2025)

Singh, R., Chhetri, M.B., Nepal, S., Paris, C.: Contextbuddy: Ai-enhanced contextual insights for security alert investigation (applied to intrusion detection). arXiv preprint arXiv:2506.09365 (2025)

work page arXiv 2025
[68]

Kalamazoo, MI: Eval Cent16, 2008 (2000)

Stufflebeam, D.L.: Guidelines for developing evaluation checklists: the checklists develop- ment checklist (cdc). Kalamazoo, MI: Eval Cent16, 2008 (2000)

work page 2008
[69]

Journal of Cybersecurity and Privacy pp

Tilbury, J., Flowerday, S.: Humans and automation: Augmenting security operation centers. Journal of Cybersecurity and Privacy pp. 388–409 (2024)

work page 2024
[70]

Automated alert classification and triage (AACT): An intelligent system for the prioritisation of cybersecurity alerts,

Turcotte, M., Labrèche, F., Paquette, S.O.: Automated alert classification and triage (aact): An intelligent system for the prioritisation of cybersecurity alerts. arXiv preprint arXiv:2505.09843 (2025)

work page arXiv 2025
[71]

Computers & Security p

van der Kleij, R., Schraagen, J.M., Cadet, B., Young, H.: Developing decision support for cybersecurity threat and incident managers. Computers & Security p. 102535 (2022)

work page 2022
[72]

Wharton, C., Kintsch, W.: An overview of construction-integration model: a theory of com- prehensionasafoundationforanewcognitivearchitecture.ACMSigartBulletinpp.169–173 (1991)

work page 1991
[73]

In: Proc

Wickramasekara, A., Scanlon, M.: A framework for integrated digital forensic investigation employing autogen ai agents. In: Proc. of ISDFS 2024. pp. 01–06 (2024)

work page 2024
[74]

In: Proc

Yen,T.F.,Oprea,A.,Onarlioglu,K.,Leetham,T.,Robertson,W.,Juels,A.,Kirda,E.:Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proc. of ACSAC 2013. pp. 199–208 (2013)

work page 2013
[75]

In: Proc

Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Automate cybersecurity data triage by leveraging human analysts’ cognitive process. In: Proc. of BigDataSecurity/HPSC/IDS 2016 (2016)

work page 2016
[76]

IEEE Systems Journal pp

Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Learning from experts’ experience: Toward automated cyber security data triage. IEEE Systems Journal pp. 603–614 (2019)

work page 2019
[77]

In: Proc

Zhou, R., Chen, L., Yu, K.: Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. In: Proc. of LREC-COLING 2024. pp. 9340–9351 (2024)

work page 2024
[78]

35045 (2025)

Zubaer, A.A., Granitzer, M., Geschwind, S., Graf Lambsdorff, J., Voss, D.: Gpt-4 shows comparableperformancetohumanexaminersinrankingopen-textanswers.ScientificReports p. 35045 (2025)

work page 2025

[1] [1]

Good practice guide for incident management. Tech. rep., ENISA (2010)

work page 2010

[2] [2]

Us-cert federal incident notification guidelines. Tech. rep., US-CERT (2015)

work page 2015

[3] [3]

Information security management standards (meti). Tech. rep., Ministry of Economy, Trade and Industry (2016)

work page 2016

[4] [4]

Computer security incident response team (csirt) services framework. Tech. Rep. Version 2.1.0, Forum of Incident Response and Security Teams (FIRST) (2019)

work page 2019

[5] [5]

Cybersecurity information sharing 5w1h for strengthening soc/csirt. Tech. Rep. V2.0, Infor- mation Security Operation providers Group Japan (ISOG-J) (2019)

work page 2019

[6] [6]

Cybersecurity incident & vulnerability response playbooks. Tech. rep., CISA (2021)

work page 2021

[7] [7]

Technical guideline on incident reporting under the eecc. Tech. Rep. DOI: 10.2824/633879, ENISA (2021), technical Guideline

work page doi:10.2824/633879 2021

[8] [8]

Textbook for security response organizations (soc/csirt). Tech. rep., Information Security Operation providers Group Japan (ISOG-J) (2023), technical Report

work page 2023

[9] [9]

Cisa incident reporting form complete question set. Tech. rep., CISA (2024)

work page 2024

[10] [10]

In: 2020 international conference on cyber security and protection of digital services (cyber security)

Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Towardsaframeworkformeasuring the performance of a security operations center analyst. In: 2020 international conference on cyber security and protection of digital services (cyber security). pp. 1–8 (2020)

work page 2020

[11] [11]

Computers & Security p

Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Asystematicmethodformeasuring the performance of a cyber security operations centre analyst. Computers & Security p. 102959 (2023)

work page 2023

[12] [12]

In: Proc

Alahmadi, B.A., Axon, L., Martinovic, I.: 99% false positives: A qualitative study of SOC analysts’ perspectives on security alarms. In: Proc. of USENIX Security 2022 (2022)

work page 2022

[13] [13]

arXiv preprint arXiv:2505.06394 (2025)

Albanese, M., Ou, X., Lybarger, K., Lende, D., Goldgof, D.B.: Towards ai-driven human- machine co-teaming for adaptive and agile cyber security operation centers. arXiv preprint arXiv:2505.06394 (2025)

work page arXiv 2025

[14] [14]

Sensors22(4), 1–15 (2022)

Albasheer,H.,MdSiraj,M.,Mubarakali,A.,ElsierTayfour,O.,Salih,S.,Hamdan,M.,Khan, S.,Zainal,A.,Kamarudeen,S.:Cyber-attackpredictionbasedonnetworkintrusiondetection systems for alert correlation techniques: A survey. Sensors22(4), 1–15 (2022)

work page 2022

[15] [15]

Journal of Information Security and Applications p

Andrade, R.O., Yoo, S.G.: Cognitive security: A comprehensive study of cognitive science in cybersecurity. Journal of Information Security and Applications p. 102352 (2019)

work page 2019

[16] [16]

Computers & Security p

Bayer, M., Frey, T., Reuter, C.: Multi-level fine-tuning, data augmentation, and few-shot learning for specialized cyber threat intelligence. Computers & Security p. 103430 (2023)

work page 2023

[17] [17]

CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)

Benton,S.E.,Hueckel,R.M.,Taicher,B.,Muckler,V.C.:Usabilityassessmentofanelectronic handoff tool to facilitate and improve postoperative communication between anesthesia and intensive care unit staff. CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)

work page 2020

[18] [18]

Computers & Security (2024)

Boffa, M., Drago, I., Mellia, M., Vassio, L., Giordano, D., Valentim, R., Houidi, Z.B.: Logprécis: Unleashing language models for automated malicious log analysis: Précis: A concise summary of essential points, statements, or facts. Computers & Security (2024)

work page 2024

[19] [19]

In: Proc

Chen,Y.,Arunasalam,A.,Celik,Z.B.:Canlargelanguagemodelsprovidesecurity&privacy advice? measuring the ability of llms to refute misconceptions. In: Proc. of ACSAC 2023 (2023)

work page 2023

[20] [20]

of EMNLP 2023

Cheng, L., Li, X., Bing, L.: Is GPT-4 a good data analyst? In: Proc. of EMNLP 2023. pp. 9496–9514 (2023) LLMs, You Can Evaluate It! 39

work page 2023

[21] [21]

In: Proc

Chiang, C.H., Lee, H.y.: A closer look into using large language models for automatic evaluation. In: Proc. of EMNLP 2023. pp. 8928–8942 (2023)

work page 2023

[22] [22]

In: Proc

Chu, S.Y., Kim, J.W., Yi, M.Y.: Think together and work better: Combining humans’ and llms’ think-aloud outcomes for effective text evaluation. In: Proc. of CHI 2025. pp. 1–23 (2025)

work page 2025

[23] [23]

Marg: Multi-agent review generation for scientific papers

D’Arcy, M., Hope, T., Birnbaum, L., Downey, D.: Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 (2024)

work page arXiv 2024

[24] [24]

Du, J., Wang, Y., Zhao, W., Deng, Z., Liu, S., Lou, R., Zou, H.P., Venkit, P.N., Zhang, N., Srinath,M., etal.: Llmsassist nlpresearchers: Critiquepaper (meta-)reviewing. In:Proc. of EMNLP 2024. pp. 5081–5099 (2024)

work page 2024

[25] [25]

In: Proc

Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. In: Proc. of NAACL

work page

[26] [26]

6556–6576 (2024)

pp. 6556–6576 (2024)

work page 2024

[27] [27]

In: Proc

Gadyatskaya, O., Papuc, D.: Chatgpt knows your attacks: Synthesizing attack trees using llms. In: Proc. of DSAI 2023. pp. 245–260 (2023)

work page 2023

[28] [28]

Sensors (2021)

González-Granadillo, G., González-Zarzosa, S., Diaz, R.: Security information and event management (siem): Analysis, trends, and usage in critical infrastructures. Sensors (2021)

work page 2021

[29] [29]

IEEE Access pp

Gupta,M.,Akiri,C.,Aryal,K.,Parker,E.,Praharaj,L.:Fromchatgpttothreatgpt:Impactof generative AI in cybersecurity and privacy. IEEE Access pp. 80218–80245 (2023)

work page 2023

[30] [30]

In: Proc

Gupta,N.,Traore,I.,deQuinan,P.M.F.:Automatedeventprioritizationforsecurityoperation center using deep learning. In: Proc. of Big Data 2019. pp. 5864–5872 (2019)

work page 2019

[31] [31]

International Journal for Quality in Health Care pp

Hales, B., Terblanche, M., Fowler, R., Sibbald, W.: Development of medical checklists for improved quality of patient care. International Journal for Quality in Health Care pp. 22–30 (2008)

work page 2008

[32] [32]

In: Proc

Hao, Y., He, H., Ho, J.C.: LLMSYN: Generating synthetic electronic health records without patient-level data. In: Proc. of MLHC 2024 (2024)

work page 2024

[33] [33]

Digital Threats pp

Happa, J., Agrafiotis, I., Helmhout, M., Bashford-Rogers, T., Goldsmith, M., Creese, S.: Assessing a decision support tool for soc analysts. Digital Threats pp. 1–35 (2021)

work page 2021

[34] [34]

Hattie,J.,Timperley,H.:Thepoweroffeedback.Reviewofeducationalresearch77,81–112 (2007)

work page 2007

[35] [35]

ACM Transactions on Information Systems pp

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems pp. 1–55 (2025)

work page 2025

[36] [36]

102609 (2022)

Husák, M., Sadlek, L., Špaček, S., Laštovička, M., Javorník, M., Komárková, J.: Crusoe: A toolsetforcybersituationalawarenessanddecisionsupportinincidenthandling.Computers & Security p. 102609 (2022)

work page 2022

[37] [37]

i’mgettinginformationthaticanactonnow

Jawad,A.,Assal,H.,Jaskolka,J.:"i’mgettinginformationthaticanactonnow":Exploring the level of actionable information in tool-generated threat reports. In: Proc. of EuroUSEC

work page

[38] [38]

NIST SP pp

Johnson, C.: Guide to cyber threat information sharing. NIST SP pp. 800–150 (2016)

work page 2016

[39] [39]

Johnson,C.:Ahandbookofincidentandaccidentreporting.Fail.Safety-CriticalSyst(2003)

work page 2003

[40] [40]

In: Proc

Kersten, L., Beelen, K., Zambon, E., Snijders, C., Allodi, L.: A field study to uncover and a tool to support the alert investigation process of tier-1 analysts. In: Proc. of USEC 2025 (2025)

work page 2025

[41] [41]

In: Proc

Kersten,L.,Darré,S.,Mulders,T.,Zambon,E.,Caselli,M.,Snijders,C.,Allodi,L.:Asecurity alertinvestigationtoolsupportingtier1analystsincontextualizingandunderstandingnetwork security events. In: Proc. of ACSAC 2024. pp. 890–905. IEEE (2024)

work page 2024

[42] [42]

In: Proc

Kersten, L., Mulders, T., Zambon, E., Snijders, C., Allodi, L.: ’give me structure’: Synthesis and evaluation of a (network) threat analysis process supporting tier 1 investigations in a security operation center. In: Proc. of SOUPS 2023. pp. 97–111 (2023) 40 H. Okada et al

work page 2023

[43] [43]

In: Proc

Kim,S.,Shin,J.,Cho,Y.,Jang,J.,Longpre,S.,Lee,H.,Yun,S.,Shin,S.,Kim,S.,Thorne,J., et al.: Prometheus: Inducing fine-grained evaluation capability in language models. In: Proc. of ICLR 2023 (2023)

work page 2023

[44] [44]

Knerler,K.,Parker,I.,Zimmerman,C.:11strategiesofaworld-classcybersecurityoperations center. Tech. Rep. Second Edition, MITRE (2022)

work page 2022

[45] [45]

In: Proc

Kokulu,F.B.,Soneji,A.,Bao,T.,Shoshitaishvili,Y.,Zhao,Z.,Doupé,A.,Ahn,G.J.:Matched and mismatched socs: A qualitative study on security operations center issues. In: Proc. of CCS 2019. pp. 1955–1970 (2019)

work page 2019

[46] [46]

International Journal of Educational Methodology pp

Kusmaryono, I., Wijayanti, D., Maharani, H.R.: Number of response options, reliability, validity,andpotentialbiasintheuseofthelikertscaleeducationandsocialscienceresearch: A literature review. International Journal of Educational Methodology pp. 625–637 (2022)

work page 2022

[47] [47]

PLOS Digital Health4(9), e0001015 (2025)

Kwong, E., Cole, A., Byrd, E., Sippo, D., Yu, F., Adapa, K., Shea, C.M., Moore, C., Das, S., Mazur, L.: Design approaches for developing quality checklists in healthcare organizations: a scoping review. PLOS Digital Health4(9), e0001015 (2025)

work page 2025

[48] [48]

NEJM AI1(8), AIoa2400196 (2024)

Liang,W.,Zhang,Y.,Cao,H.,Wang,B.,Ding,D.Y.,Yang,X.,Vodrahalli,K.,He,S.,Smith, D.S., Yin, Y., et al.: Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI1(8), AIoa2400196 (2024)

work page 2024

[49] [49]

org/CorpusID:263830310

Liu, R., Shah, N.B.: Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622 (2023)

work page arXiv 2023

[50] [50]

In: Proc

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Proc. of EMNLP 2023. pp. 2511–2522 (2023)

work page 2023

[51] [51]

HD - Eval : Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition , February 2024

Liu,Y.,Yang,T.,Huang,S.,Zhang,Z.,Huang,H.,Wei,F.,Deng,W.,Sun,F.,Zhang,Q.:Hd- eval: Aligning large language model evaluators through hierarchical criteria decomposition. arXiv preprint arXiv:2402.15754 (2024)

work page arXiv 2024

[52] [52]

Computers (2025)

Loumachi, F.Y., Ghanem, M.C., Ferrag, M.A.: Advancing cyber incident timeline analysis through retrieval-augmented generation and large language models. Computers (2025)

work page 2025

[53] [53]

301683 (2024)

Michelet, G., Breitinger, F.: Chatgpt, llama, can you write my report? an experiment on assisteddigitalforensicsreportswrittenusing(local)largelanguagemodels.ForensicScience International: Digital Investigation p. 301683 (2024)

work page 2024

[54] [54]

arXiv preprint arXiv:2401.10036 (2024)

Mitra,S.,Neupane,S.,Chakraborty,T.,Mittal,S.,Piplai,A.,Gaur,M.,Rahimi,S.:LOCAL- INTEL:generatingorganizationalthreatintelligencefromglobalandlocalcyberknowledge. arXiv preprint arXiv:2401.10036 (2024)

work page arXiv 2024

[55] [55]

In: Proc

Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., Sarkar, A.: Evaluating the evaluator: Measuring llms’ adherence to task evaluation instruc- tions. In: Proc. of AAAI 2025. pp. 19589–19597 (2025)

work page 2025

[56] [56]

Proceedings of the ACM on Human-Computer Interaction pp

Nepal, S., Hernandez, J., Lewis, R., Chaudhry, A., Houck, B., Knudsen, E., Rojas, R., Tankus,B.,Prafullchandra,H.,Czerwinski,M.:Burnoutincybersecurityincidentresponders: Exploring the factors that light the fire. Proceedings of the ACM on Human-Computer Interaction pp. 1–35 (2024)

work page 2024

[57] [57]

In: Proc

Oba, T., Taniguchi, T., Yanai, N.: Score and you shall find: A novel regularization technique for cyberattack detection in industrial control systems. In: Proc. of RICSS 2024. pp. 47–56. ACM (2024)

work page 2024

[58] [58]

Perrina,F.,Marchiori,F.,Conti,M.,Verde,N.V.:AGIR:Automatingcyberthreatintelligence reportingwithnaturallanguagegeneration.In:Proc.ofBigData2023.pp.3053–3062(2023)

work page 2023

[59] [59]

In: Proc

Renners, L., Heine, F., Kleiner, C., Rodosek, G.D.: Adaptive and intelligible prioritization for network security incidents. In: Proc. of Cyber Security 2019. pp. 1–8 (2019)

work page 2019

[60] [60]

Journal of Critical Care72, 154153 (2022) LLMs, You Can Evaluate It! 41

Rose, L., Istanboulian, L., Amaral, A.C.K.B., Burry, L., Cox, C.E., Cuthbertson, B.H., Iwashyna, T.J., Dale, C.M., Fraser, I.: Co-designed and consensus based development of a quality improvement checklist of patient and family-centered actionable processes of care for adults with persistent critical illness. Journal of Critical Care72, 154153 (2022) LLMs...

work page 2022

[61] [61]

Computers & Operations Research pp

Ryan, J.J., Mazzuchi, T.A., Ryan, D.J., Lopez de la Cruz, J., Cooke, R.: Quantifying infor- mation security risks using expert judgment elicitation. Computers & Operations Research pp. 774–784 (2012)

work page 2012

[62] [62]

arXiv preprint arXiv:2501.18099 (2025)

Saha, S., Li, X., Ghazvininejad, M., Weston, J., Wang, T.: Learning to plan & reason for evaluation with thinking-llm-as-a-judge. arXiv preprint arXiv:2501.18099 (2025)

work page arXiv 2025

[63] [63]

Scarfone,K.A.,Grance,T.,Masone,K.:Sp800-61rev.1.computersecurityincidenthandling guide (2008)

work page 2008

[64] [64]

IEEE Access10, 27881–27898 (2022)

Shahjee,D.,Ware,N.:Integratednetworkandsecurityoperationcenter:Asystematicanalysis. IEEE Access10, 27881–27898 (2022)

work page 2022

[65] [65]

arXiv preprint arXiv:2407.15428 (2024)

Sharma, R., Okada, H., Oba, T., Subramanian, K., Yanai, N., Pranata, S.: Decoding bac- net packets: A large language model approach for packet interpretation. arXiv preprint arXiv:2407.15428 (2024)

work page arXiv 2024

[66] [66]

In: Proc

Shibli,A.M.,Pritom,M.M.A.,Gupta,M.:Abusegpt:Abuseofgenerativeaichatbotstocreate smishing campaigns. In: Proc. of ISDFS 2024. pp. 1–6 (2024)

work page 2024

[67] [67]

arXiv preprint arXiv:2506.09365 (2025)

Singh, R., Chhetri, M.B., Nepal, S., Paris, C.: Contextbuddy: Ai-enhanced contextual insights for security alert investigation (applied to intrusion detection). arXiv preprint arXiv:2506.09365 (2025)

work page arXiv 2025

[68] [68]

Kalamazoo, MI: Eval Cent16, 2008 (2000)

Stufflebeam, D.L.: Guidelines for developing evaluation checklists: the checklists develop- ment checklist (cdc). Kalamazoo, MI: Eval Cent16, 2008 (2000)

work page 2008

[69] [69]

Journal of Cybersecurity and Privacy pp

Tilbury, J., Flowerday, S.: Humans and automation: Augmenting security operation centers. Journal of Cybersecurity and Privacy pp. 388–409 (2024)

work page 2024

[70] [70]

Automated alert classification and triage (AACT): An intelligent system for the prioritisation of cybersecurity alerts,

Turcotte, M., Labrèche, F., Paquette, S.O.: Automated alert classification and triage (aact): An intelligent system for the prioritisation of cybersecurity alerts. arXiv preprint arXiv:2505.09843 (2025)

work page arXiv 2025

[71] [71]

Computers & Security p

van der Kleij, R., Schraagen, J.M., Cadet, B., Young, H.: Developing decision support for cybersecurity threat and incident managers. Computers & Security p. 102535 (2022)

work page 2022

[72] [72]

Wharton, C., Kintsch, W.: An overview of construction-integration model: a theory of com- prehensionasafoundationforanewcognitivearchitecture.ACMSigartBulletinpp.169–173 (1991)

work page 1991

[73] [73]

In: Proc

Wickramasekara, A., Scanlon, M.: A framework for integrated digital forensic investigation employing autogen ai agents. In: Proc. of ISDFS 2024. pp. 01–06 (2024)

work page 2024

[74] [74]

In: Proc

Yen,T.F.,Oprea,A.,Onarlioglu,K.,Leetham,T.,Robertson,W.,Juels,A.,Kirda,E.:Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proc. of ACSAC 2013. pp. 199–208 (2013)

work page 2013

[75] [75]

In: Proc

Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Automate cybersecurity data triage by leveraging human analysts’ cognitive process. In: Proc. of BigDataSecurity/HPSC/IDS 2016 (2016)

work page 2016

[76] [76]

IEEE Systems Journal pp

Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Learning from experts’ experience: Toward automated cyber security data triage. IEEE Systems Journal pp. 603–614 (2019)

work page 2019

[77] [77]

In: Proc

Zhou, R., Chen, L., Yu, K.: Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. In: Proc. of LREC-COLING 2024. pp. 9340–9351 (2024)

work page 2024

[78] [78]

35045 (2025)

Zubaer, A.A., Granitzer, M., Geschwind, S., Graf Lambsdorff, J., Voss, D.: Gpt-4 shows comparableperformancetohumanexaminersinrankingopen-textanswers.ScientificReports p. 35045 (2025)

work page 2025