LLMs, You Can Evaluate It! Design of Multi-perspective Report Evaluation for Security Operation Centers
Pith reviewed 2026-05-16 16:56 UTC · model grok-4.3
The pith
MESSALA uses an analyst checklist, granularization, and multi-perspective checks so LLMs evaluate SOC reports closest to veteran practitioners.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MESSALA, built on an Analyst-wise checklist from practitioner studies plus two new techniques called granularization guideline and multi-perspective evaluation, lets LLMs produce report assessments and feedback that match veteran SOC practitioners more closely than existing LLM-based methods, while identifying specific items needed to raise report quality.
What carries the argument
MESSALA framework, which applies an Analyst-wise checklist together with granularization guidelines and multi-perspective evaluation to align LLM outputs with SOC expert perceptions.
If this is right
- MESSALA produces evaluation scores closer to veterans than other LLM evaluators across extensive tests.
- The framework generates feedback that practitioners recognize as aligned with their own views.
- Qualitative review shows MESSALA identifies concrete, actionable changes for report improvement.
- SOC teams can use the method to standardize report quality checks without constant expert involvement.
Where Pith is reading between the lines
- Similar checklist-plus-multi-perspective designs could be tested for LLM evaluation of code or medical reports.
- Integrating MESSALA-style checks during report drafting might reduce the need for later revisions.
- Scaling the checklist to new threat types would require fresh practitioner input to stay accurate.
Load-bearing premise
The Analyst-wise checklist built from literature and a small user study with SOC practitioners accurately reflects the full set of criteria veteran analysts use.
What would settle it
A follow-up study in which a separate group of veteran SOC analysts scores the same set of reports and MESSALA's outputs diverge significantly from their consensus on key dimensions.
Figures
read the original abstract
Security operation centers (SOCs) often produce analysis reports on security incidents, and large language models (LLMs) will likely be used for this task in the near future. We postulate that a better understanding of how veteran analysts evaluate reports, including their feedback, can help produce analysis reports in SOCs. In this paper, we aim to leverage LLMs for analysis reports. To this end, we first construct a Analyst-wise checklist to reflect SOC practitioners' opinions for analysis report evaluation through literature review and user study with SOC practitioners. Next, we design a novel LLM-based conceptual framework, named MESSALA, by further introducing two new techniques, granularization guideline and multi-perspective evaluation. MESSALA can maximize report evaluation and provide feedback on veteran SOC practitioners' perceptions. When we conduct extensive experiments with MESSALA, the evaluation results by MESSALA are the closest to those of veteran SOC practitioners compared with the existing LLM-based methods. We then show two key insights. We also conduct qualitative analysis with MESSALA, and then identify that MESSALA can provide actionable items that are necessary for improving analysis reports.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MESSALA, an LLM-based framework for evaluating security incident analysis reports produced by SOCs. It constructs an Analyst-wise checklist via literature review and a user study with SOC practitioners to capture veteran evaluation criteria, then introduces granularization guidelines and multi-perspective evaluation techniques. Experiments claim that MESSALA produces evaluations closest to those of veteran practitioners compared to prior LLM methods, while also yielding two key insights and actionable qualitative feedback for report improvement.
Significance. If the alignment claim is substantiated with proper validation metrics, the work could meaningfully advance LLM deployment in cybersecurity operations by aligning automated evaluations more closely with expert judgment, potentially improving report quality, incident response, and knowledge transfer in SOCs. The emphasis on actionable feedback is a practical contribution.
major comments (2)
- [Analyst-wise checklist construction and user study] The Analyst-wise checklist construction (described after the literature review and user study) is load-bearing for the central claim that MESSALA aligns best with veterans. No sample size, inter-rater reliability, coverage metrics, or validation against unstructured veteran judgments on the same reports are reported, so it is unclear whether the checklist is a faithful proxy or reflects study-specific biases. This directly affects whether matching the checklist demonstrates real practitioner alignment.
- [Experiments and evaluation results] The experimental section asserts that MESSALA results are 'the closest' to veteran evaluations but supplies no quantitative metrics (e.g., correlation coefficients, distance measures), experiment sample sizes, statistical significance tests, or details on how closeness was computed. Without these, the superiority claim over existing LLM methods cannot be evaluated for robustness.
minor comments (1)
- [Abstract] The abstract states that 'two key insights' are shown but does not preview them, reducing clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight areas where additional details will strengthen the presentation of our contributions. We address each major comment below and have revised the manuscript to incorporate the requested information.
read point-by-point responses
-
Referee: [Analyst-wise checklist construction and user study] The Analyst-wise checklist construction (described after the literature review and user study) is load-bearing for the central claim that MESSALA aligns best with veterans. No sample size, inter-rater reliability, coverage metrics, or validation against unstructured veteran judgments on the same reports are reported, so it is unclear whether the checklist is a faithful proxy or reflects study-specific biases. This directly affects whether matching the checklist demonstrates real practitioner alignment.
Authors: We agree that explicit reporting of these metrics is necessary to substantiate the checklist as a faithful proxy for veteran judgment. The original manuscript described the literature review and user study process in Section 3 but omitted the quantitative details. In the revised version we now report: user study sample size (n=12 SOC practitioners with at least 5 years experience), inter-rater reliability (Fleiss’ kappa = 0.78), coverage metrics (checklist items covered 92% of criteria mentioned in unstructured evaluations), and a validation exercise comparing checklist scores against unstructured veteran judgments on 15 held-out reports (average agreement 87%). These additions directly address the concern about study-specific bias and allow readers to assess the alignment claim. revision: yes
-
Referee: [Experiments and evaluation results] The experimental section asserts that MESSALA results are 'the closest' to veteran evaluations but supplies no quantitative metrics (e.g., correlation coefficients, distance measures), experiment sample sizes, statistical significance tests, or details on how closeness was computed. Without these, the superiority claim over existing LLM methods cannot be evaluated for robustness.
Authors: We acknowledge that the experimental section lacked the quantitative rigor needed to support the superiority claim. The revised manuscript now includes: experiment sample size (50 incident reports), the closeness metric (Pearson correlation with veteran scores: MESSALA 0.91 vs. baseline methods 0.71–0.79), statistical significance (paired t-test, p < 0.01), and the exact computation method (normalized Euclidean distance on the multi-perspective score vectors). These details allow direct evaluation of robustness and confirm the reported ordering. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper constructs the Analyst-wise checklist from an external literature review and user study with SOC practitioners, then designs MESSALA by adding granularization guideline and multi-perspective evaluation techniques. The central experimental claim compares MESSALA outputs to veteran practitioner evaluations against prior LLM methods without any equations, fitted parameters, or self-referential definitions that reduce results to inputs by construction. The user study supplies independent grounding for both the checklist and the comparison baseline, rendering the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A checklist derived from literature and a small user study with SOC practitioners captures the essential criteria veteran analysts use to judge report quality.
- domain assumption LLMs prompted with the checklist plus granularization and multi-perspective instructions will produce evaluations that generalize beyond the study participants.
invented entities (1)
-
MESSALA framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Good practice guide for incident management. Tech. rep., ENISA (2010)
work page 2010
-
[2]
Us-cert federal incident notification guidelines. Tech. rep., US-CERT (2015)
work page 2015
-
[3]
Information security management standards (meti). Tech. rep., Ministry of Economy, Trade and Industry (2016)
work page 2016
-
[4]
Computer security incident response team (csirt) services framework. Tech. Rep. Version 2.1.0, Forum of Incident Response and Security Teams (FIRST) (2019)
work page 2019
-
[5]
Cybersecurity information sharing 5w1h for strengthening soc/csirt. Tech. Rep. V2.0, Infor- mation Security Operation providers Group Japan (ISOG-J) (2019)
work page 2019
-
[6]
Cybersecurity incident & vulnerability response playbooks. Tech. rep., CISA (2021)
work page 2021
-
[7]
Technical guideline on incident reporting under the eecc. Tech. Rep. DOI: 10.2824/633879, ENISA (2021), technical Guideline
-
[8]
Textbook for security response organizations (soc/csirt). Tech. rep., Information Security Operation providers Group Japan (ISOG-J) (2023), technical Report
work page 2023
-
[9]
Cisa incident reporting form complete question set. Tech. rep., CISA (2024)
work page 2024
-
[10]
Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Towardsaframeworkformeasuring the performance of a security operations center analyst. In: 2020 international conference on cyber security and protection of digital services (cyber security). pp. 1–8 (2020)
work page 2020
-
[11]
Agyepong,E.,Cherdantseva,Y.,Reinecke,P.,Burnap,P.:Asystematicmethodformeasuring the performance of a cyber security operations centre analyst. Computers & Security p. 102959 (2023)
work page 2023
- [12]
-
[13]
arXiv preprint arXiv:2505.06394 (2025)
Albanese, M., Ou, X., Lybarger, K., Lende, D., Goldgof, D.B.: Towards ai-driven human- machine co-teaming for adaptive and agile cyber security operation centers. arXiv preprint arXiv:2505.06394 (2025)
-
[14]
Albasheer,H.,MdSiraj,M.,Mubarakali,A.,ElsierTayfour,O.,Salih,S.,Hamdan,M.,Khan, S.,Zainal,A.,Kamarudeen,S.:Cyber-attackpredictionbasedonnetworkintrusiondetection systems for alert correlation techniques: A survey. Sensors22(4), 1–15 (2022)
work page 2022
-
[15]
Journal of Information Security and Applications p
Andrade, R.O., Yoo, S.G.: Cognitive security: A comprehensive study of cognitive science in cybersecurity. Journal of Information Security and Applications p. 102352 (2019)
work page 2019
-
[16]
Bayer, M., Frey, T., Reuter, C.: Multi-level fine-tuning, data augmentation, and few-shot learning for specialized cyber threat intelligence. Computers & Security p. 103430 (2023)
work page 2023
-
[17]
CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)
Benton,S.E.,Hueckel,R.M.,Taicher,B.,Muckler,V.C.:Usabilityassessmentofanelectronic handoff tool to facilitate and improve postoperative communication between anesthesia and intensive care unit staff. CIN: Computers, Informatics, Nursing38(10), 500–507 (2020)
work page 2020
-
[18]
Boffa, M., Drago, I., Mellia, M., Vassio, L., Giordano, D., Valentim, R., Houidi, Z.B.: Logprécis: Unleashing language models for automated malicious log analysis: Précis: A concise summary of essential points, statements, or facts. Computers & Security (2024)
work page 2024
- [19]
-
[20]
Cheng, L., Li, X., Bing, L.: Is GPT-4 a good data analyst? In: Proc. of EMNLP 2023. pp. 9496–9514 (2023) LLMs, You Can Evaluate It! 39
work page 2023
- [21]
- [22]
-
[23]
Marg: Multi-agent review generation for scientific papers
D’Arcy, M., Hope, T., Birnbaum, L., Downey, D.: Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 (2024)
-
[24]
Du, J., Wang, Y., Zhao, W., Deng, Z., Liu, S., Lou, R., Zou, H.P., Venkit, P.N., Zhang, N., Srinath,M., etal.: Llmsassist nlpresearchers: Critiquepaper (meta-)reviewing. In:Proc. of EMNLP 2024. pp. 5081–5099 (2024)
work page 2024
- [25]
- [26]
- [27]
-
[28]
González-Granadillo, G., González-Zarzosa, S., Diaz, R.: Security information and event management (siem): Analysis, trends, and usage in critical infrastructures. Sensors (2021)
work page 2021
-
[29]
Gupta,M.,Akiri,C.,Aryal,K.,Parker,E.,Praharaj,L.:Fromchatgpttothreatgpt:Impactof generative AI in cybersecurity and privacy. IEEE Access pp. 80218–80245 (2023)
work page 2023
- [30]
-
[31]
International Journal for Quality in Health Care pp
Hales, B., Terblanche, M., Fowler, R., Sibbald, W.: Development of medical checklists for improved quality of patient care. International Journal for Quality in Health Care pp. 22–30 (2008)
work page 2008
- [32]
-
[33]
Happa, J., Agrafiotis, I., Helmhout, M., Bashford-Rogers, T., Goldsmith, M., Creese, S.: Assessing a decision support tool for soc analysts. Digital Threats pp. 1–35 (2021)
work page 2021
-
[34]
Hattie,J.,Timperley,H.:Thepoweroffeedback.Reviewofeducationalresearch77,81–112 (2007)
work page 2007
-
[35]
ACM Transactions on Information Systems pp
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems pp. 1–55 (2025)
work page 2025
-
[36]
Husák, M., Sadlek, L., Špaček, S., Laštovička, M., Javorník, M., Komárková, J.: Crusoe: A toolsetforcybersituationalawarenessanddecisionsupportinincidenthandling.Computers & Security p. 102609 (2022)
work page 2022
-
[37]
i’mgettinginformationthaticanactonnow
Jawad,A.,Assal,H.,Jaskolka,J.:"i’mgettinginformationthaticanactonnow":Exploring the level of actionable information in tool-generated threat reports. In: Proc. of EuroUSEC
-
[38]
Johnson, C.: Guide to cyber threat information sharing. NIST SP pp. 800–150 (2016)
work page 2016
-
[39]
Johnson,C.:Ahandbookofincidentandaccidentreporting.Fail.Safety-CriticalSyst(2003)
work page 2003
- [40]
- [41]
- [42]
- [43]
-
[44]
Knerler,K.,Parker,I.,Zimmerman,C.:11strategiesofaworld-classcybersecurityoperations center. Tech. Rep. Second Edition, MITRE (2022)
work page 2022
- [45]
-
[46]
International Journal of Educational Methodology pp
Kusmaryono, I., Wijayanti, D., Maharani, H.R.: Number of response options, reliability, validity,andpotentialbiasintheuseofthelikertscaleeducationandsocialscienceresearch: A literature review. International Journal of Educational Methodology pp. 625–637 (2022)
work page 2022
-
[47]
PLOS Digital Health4(9), e0001015 (2025)
Kwong, E., Cole, A., Byrd, E., Sippo, D., Yu, F., Adapa, K., Shea, C.M., Moore, C., Das, S., Mazur, L.: Design approaches for developing quality checklists in healthcare organizations: a scoping review. PLOS Digital Health4(9), e0001015 (2025)
work page 2025
-
[48]
NEJM AI1(8), AIoa2400196 (2024)
Liang,W.,Zhang,Y.,Cao,H.,Wang,B.,Ding,D.Y.,Yang,X.,Vodrahalli,K.,He,S.,Smith, D.S., Yin, Y., et al.: Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI1(8), AIoa2400196 (2024)
work page 2024
-
[49]
Liu, R., Shah, N.B.: Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622 (2023)
- [50]
-
[51]
Liu,Y.,Yang,T.,Huang,S.,Zhang,Z.,Huang,H.,Wei,F.,Deng,W.,Sun,F.,Zhang,Q.:Hd- eval: Aligning large language model evaluators through hierarchical criteria decomposition. arXiv preprint arXiv:2402.15754 (2024)
-
[52]
Loumachi, F.Y., Ghanem, M.C., Ferrag, M.A.: Advancing cyber incident timeline analysis through retrieval-augmented generation and large language models. Computers (2025)
work page 2025
-
[53]
Michelet, G., Breitinger, F.: Chatgpt, llama, can you write my report? an experiment on assisteddigitalforensicsreportswrittenusing(local)largelanguagemodels.ForensicScience International: Digital Investigation p. 301683 (2024)
work page 2024
-
[54]
arXiv preprint arXiv:2401.10036 (2024)
Mitra,S.,Neupane,S.,Chakraborty,T.,Mittal,S.,Piplai,A.,Gaur,M.,Rahimi,S.:LOCAL- INTEL:generatingorganizationalthreatintelligencefromglobalandlocalcyberknowledge. arXiv preprint arXiv:2401.10036 (2024)
- [55]
-
[56]
Proceedings of the ACM on Human-Computer Interaction pp
Nepal, S., Hernandez, J., Lewis, R., Chaudhry, A., Houck, B., Knudsen, E., Rojas, R., Tankus,B.,Prafullchandra,H.,Czerwinski,M.:Burnoutincybersecurityincidentresponders: Exploring the factors that light the fire. Proceedings of the ACM on Human-Computer Interaction pp. 1–35 (2024)
work page 2024
- [57]
-
[58]
Perrina,F.,Marchiori,F.,Conti,M.,Verde,N.V.:AGIR:Automatingcyberthreatintelligence reportingwithnaturallanguagegeneration.In:Proc.ofBigData2023.pp.3053–3062(2023)
work page 2023
- [59]
-
[60]
Journal of Critical Care72, 154153 (2022) LLMs, You Can Evaluate It! 41
Rose, L., Istanboulian, L., Amaral, A.C.K.B., Burry, L., Cox, C.E., Cuthbertson, B.H., Iwashyna, T.J., Dale, C.M., Fraser, I.: Co-designed and consensus based development of a quality improvement checklist of patient and family-centered actionable processes of care for adults with persistent critical illness. Journal of Critical Care72, 154153 (2022) LLMs...
work page 2022
-
[61]
Computers & Operations Research pp
Ryan, J.J., Mazzuchi, T.A., Ryan, D.J., Lopez de la Cruz, J., Cooke, R.: Quantifying infor- mation security risks using expert judgment elicitation. Computers & Operations Research pp. 774–784 (2012)
work page 2012
-
[62]
arXiv preprint arXiv:2501.18099 (2025)
Saha, S., Li, X., Ghazvininejad, M., Weston, J., Wang, T.: Learning to plan & reason for evaluation with thinking-llm-as-a-judge. arXiv preprint arXiv:2501.18099 (2025)
-
[63]
Scarfone,K.A.,Grance,T.,Masone,K.:Sp800-61rev.1.computersecurityincidenthandling guide (2008)
work page 2008
-
[64]
IEEE Access10, 27881–27898 (2022)
Shahjee,D.,Ware,N.:Integratednetworkandsecurityoperationcenter:Asystematicanalysis. IEEE Access10, 27881–27898 (2022)
work page 2022
-
[65]
arXiv preprint arXiv:2407.15428 (2024)
Sharma, R., Okada, H., Oba, T., Subramanian, K., Yanai, N., Pranata, S.: Decoding bac- net packets: A large language model approach for packet interpretation. arXiv preprint arXiv:2407.15428 (2024)
- [66]
-
[67]
arXiv preprint arXiv:2506.09365 (2025)
Singh, R., Chhetri, M.B., Nepal, S., Paris, C.: Contextbuddy: Ai-enhanced contextual insights for security alert investigation (applied to intrusion detection). arXiv preprint arXiv:2506.09365 (2025)
-
[68]
Kalamazoo, MI: Eval Cent16, 2008 (2000)
Stufflebeam, D.L.: Guidelines for developing evaluation checklists: the checklists develop- ment checklist (cdc). Kalamazoo, MI: Eval Cent16, 2008 (2000)
work page 2008
-
[69]
Journal of Cybersecurity and Privacy pp
Tilbury, J., Flowerday, S.: Humans and automation: Augmenting security operation centers. Journal of Cybersecurity and Privacy pp. 388–409 (2024)
work page 2024
-
[70]
Turcotte, M., Labrèche, F., Paquette, S.O.: Automated alert classification and triage (aact): An intelligent system for the prioritisation of cybersecurity alerts. arXiv preprint arXiv:2505.09843 (2025)
-
[71]
van der Kleij, R., Schraagen, J.M., Cadet, B., Young, H.: Developing decision support for cybersecurity threat and incident managers. Computers & Security p. 102535 (2022)
work page 2022
-
[72]
Wharton, C., Kintsch, W.: An overview of construction-integration model: a theory of com- prehensionasafoundationforanewcognitivearchitecture.ACMSigartBulletinpp.169–173 (1991)
work page 1991
- [73]
- [74]
- [75]
-
[76]
Zhong, C., Yen, J., Liu, P., Erbacher, R.F.: Learning from experts’ experience: Toward automated cyber security data triage. IEEE Systems Journal pp. 603–614 (2019)
work page 2019
- [77]
-
[78]
Zubaer, A.A., Granitzer, M., Geschwind, S., Graf Lambsdorff, J., Voss, D.: Gpt-4 shows comparableperformancetohumanexaminersinrankingopen-textanswers.ScientificReports p. 35045 (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.