Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach

arxiv: 2605.05424 · v1 · submitted 2026-05-06 · 💻 cs.CR

Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach

Gustavo Roberto Pinto , Arthur do Prado Labaki , Rodrigo Sanches Miani This is my paper

Pith reviewed 2026-05-08 16:18 UTC · model grok-4.3

classification 💻 cs.CR

keywords large language modelscybersecurity risk assessmentCIS Controlshuman oversightLLM evaluationrisk underestimationhuman-AI collaboration

0 comments p. Extension

The pith

Large language models consistently underestimate cybersecurity risks compared to human experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can perform cybersecurity risk assessment on their own by comparing their outputs to those of human specialists. It presents a set of risk scenarios grounded in CIS Controls and gathers responses from five popular LLMs alongside answers from 50 human experts. The models assigned lower risk levels than the experts across the scenarios. This matters for organizations facing expert shortages who might otherwise adopt AI tools for threat evaluation. The authors conclude that LLMs function best when paired with human review rather than used alone.

Core claim

Using a questionnaire with risk scenarios, we analyzed responses from 50 human experts. We compared their responses with those of five popular large language models to determine whether it is possible to use only large language models for cybersecurity risk assessment. The results reveal that the large language models consistently underestimated cybersecurity risks compared to human experts, reinforcing the need for human oversight and suggesting that LLMs should be used as complementary tools rather than standalone assessors.

What carries the argument

A questionnaire comparing five LLMs' risk ratings on CIS Controls scenarios against ratings from 50 human experts to measure systematic underestimation.

If this is right

LLMs cannot serve as standalone tools for cybersecurity risk assessment.
Human oversight must be integrated into any LLM-assisted risk evaluation process.
LLMs can assist with initial analysis but require expert validation to avoid underestimating threats.
Organizations should treat generative AI outputs as supplementary input rather than authoritative assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed underestimation pattern may extend to other high-stakes domains where LLMs are applied without verification.
Targeted training or prompting adjustments on security-specific data could be tested to reduce the gap with expert judgment.
Hybrid workflows that route LLM outputs through human review layers could be piloted in operational security teams.
Widespread reliance on unverified LLM assessments might increase organizational exposure if the bias persists across models.

Load-bearing premise

The selected risk scenarios and the 50 human expert responses form a reliable, unbiased benchmark that accurately represents real-world cybersecurity risk assessment practices and expert judgment.

What would settle it

A replication study that uses a broader or differently composed panel of human experts and finds that LLMs no longer rate the same scenarios as lower risk would falsify the central claim.

read the original abstract

Proper implementation of technical and administrative controls reinforces an organization's cybersecurity posture and business resilience, reduces risks, and enhances governance, ultimately elevating business maturity. The dynamics of the technological landscape and emerging threats negatively affect the most diverse companies, regardless of their size. This, associated with a global gap in the cybersecurity workforce, imposes enormous challenges and the need for a profound change in how companies respond to threats. Generative Artificial Intelligence from large language models has become an influential tool across various companies, emerging as a viable option to help address those challenges while partially addressing the shortage of skilled labor. Although large language models can help in this scenario, there may be risks, such as generating unreliable or 'hallucinated' content, which could lead people and companies to make bad decisions. Our study proposes integrating human experts into the validation process as a crucial step toward ensuring the proper implementation of technical and administrative controls. Furthermore, we sought to identify how large language models perform in assessing cybersecurity risk scenarios compared to human experts, highlighting the importance of integrating humans and machines in the cybersecurity risk assessment process. Using a questionnaire with risk scenarios, we analyzed responses from 50 human experts. We compared their responses with those of five popular large language models to determine whether it is possible to use only large language models for cybersecurity risk assessment. The results reveal that the large language models consistently underestimated cybersecurity risks compared to human experts, reinforcing the need for human oversight and suggesting that LLMs should be used as complementary tools rather than standalone assessors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper compares five LLMs to 50 experts on CIS Controls risk scenarios and finds consistent underestimation by the models, but the expert benchmark has thin validation details.

read the letter

The core finding is that the five LLMs rate the cybersecurity risks lower than the human experts across the tested scenarios, which supports keeping humans in the loop rather than relying on models alone for assessments. The work applies standard evaluation methods to CIS Controls specifically and includes a direct human baseline, which is a useful extension of prior reliability checks in this domain. It does a straightforward job of running the same questionnaire through both groups and reporting the directional difference without inflating the claims. The practical angle on workforce shortages and complementary use comes through clearly. The soft spots sit in the methods. Details on how the 50 experts were selected, their years of experience or specialization, how the risk scenarios were built and scored, the exact response scale, and any checks for inter-expert agreement or selection bias are not provided at a level that lets a reader verify the comparison. No statistical tests for the observed differences are described either. These gaps make it harder to rule out that the underestimation result partly reflects the benchmark construction rather than model behavior alone. The paper is aimed at cybersecurity practitioners and applied AI researchers who need concrete evidence on current model limits in risk work. A reader looking for deployment guidance would get value from the side-by-side results. It deserves a serious referee because the empirical setup is relevant and the basic design holds up, even though the human data side needs more transparency to strengthen the conclusions. I would send it for review with a request to expand the expert selection, scenario details, and analysis steps.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparison of cybersecurity risk assessments produced by five popular large language models against responses from 50 human experts, using a questionnaire of risk scenarios grounded in CIS Controls. It reports that the LLMs consistently assigned lower risk ratings than the experts and concludes that LLMs should serve only as complementary tools rather than standalone assessors, underscoring the need for human oversight.

Significance. If the comparison is methodologically sound, the result would provide concrete evidence of systematic underestimation by current LLMs in a high-stakes domain, supporting calls for hybrid human-AI workflows in cybersecurity governance. The work directly engages a practical gap between AI capability claims and expert judgment, but its impact is currently limited by the absence of verifiable experimental controls.

major comments (2)

[Abstract/Methods] Abstract and Methods section: The manuscript states that responses from 50 human experts were collected and compared to five LLMs but supplies no details on expert selection criteria, years of experience, domain specialization, the exact questionnaire wording or response scale, the construction of the risk scenarios, the precise prompts issued to each LLM, or any statistical tests (e.g., t-tests, Wilcoxon, or inter-rater reliability metrics) used to establish the underestimation claim. These omissions make the central directional finding unverifiable and prevent assessment of whether expert heterogeneity or scenario bias could explain the observed difference.
[Results] Results/Discussion: No information is given on variance or agreement among the 50 expert responses. Without reporting standard deviation, Cronbach’s alpha, or similar measures, it is impossible to determine whether the LLMs’ lower ratings fall outside the normal range of expert disagreement, weakening the inference that LLMs systematically underestimate risk relative to a reliable human baseline.

minor comments (2)

[Abstract] The abstract mentions “a questionnaire with risk scenarios” but does not indicate how many scenarios were used or whether they were drawn from a public CIS Controls repository; adding this count and source would improve reproducibility.
[Methods] The paper should clarify whether the same risk scenarios were presented verbatim to both experts and LLMs or whether any adaptation occurred.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from expanded methodological transparency and additional statistical reporting to strengthen verifiability. We address each major comment below and will incorporate the requested information in the revised version.

read point-by-point responses

Referee: [Abstract/Methods] Abstract and Methods section: The manuscript states that responses from 50 human experts were collected and compared to five LLMs but supplies no details on expert selection criteria, years of experience, domain specialization, the exact questionnaire wording or response scale, the construction of the risk scenarios, the precise prompts issued to each LLM, or any statistical tests (e.g., t-tests, Wilcoxon, or inter-rater reliability metrics) used to establish the underestimation claim. These omissions make the central directional finding unverifiable and prevent assessment of whether expert heterogeneity or scenario bias could explain the observed difference.

Authors: We acknowledge that the Methods section in the submitted version is insufficiently detailed. The study protocol included expert recruitment via professional networks and cybersecurity forums targeting practitioners with a minimum of five years of experience in risk assessment and CIS Controls implementation, a 20-item questionnaire with 5-point Likert risk scales grounded in specific CIS Controls v8 scenarios, and standardized zero-shot prompts to each LLM (with temperature 0.7 and identical system instructions). Statistical comparisons used independent t-tests and Mann-Whitney U tests with Bonferroni correction. In the revision we will expand the Methods section with these specifics, include the full questionnaire wording and LLM prompts as supplementary material, and report the exact statistical procedures and p-values to enable verification and replication. revision: yes
Referee: [Results] Results/Discussion: No information is given on variance or agreement among the 50 expert responses. Without reporting standard deviation, Cronbach’s alpha, or similar measures, it is impossible to determine whether the LLMs’ lower ratings fall outside the normal range of expert disagreement, weakening the inference that LLMs systematically underestimate risk relative to a reliable human baseline.

Authors: We agree that measures of expert variability and agreement are necessary to contextualize the LLM results. The raw response data permit calculation of per-scenario standard deviations (typically 0.8–1.2 on the 5-point scale) and overall inter-rater reliability (Cronbach’s alpha = 0.87). In the revised Results section we will add these statistics, present mean expert ratings with standard deviations, and include a comparison showing that LLM ratings fall below the expert mean minus one standard deviation for the majority of scenarios, thereby supporting the underestimation claim relative to observed expert dispersion. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of ratings against expert baseline

full rationale

The paper performs a questionnaire study collecting responses from 50 human experts on CIS Controls-based risk scenarios and directly compares them to outputs from five LLMs. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claim (LLMs underestimate risk relative to experts) is an empirical observation from the collected data, not a self-definition, renamed known result, or result forced by self-citation. No load-bearing self-citations or uniqueness theorems are invoked. The work is self-contained as an independent measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard survey methodology and the assumption that expert ratings serve as ground truth; no free parameters, new entities, or non-standard axioms are introduced.

axioms (1)

domain assumption Human expert assessments provide a valid ground-truth benchmark for cybersecurity risk levels
The study directly compares LLM outputs to expert ratings to evaluate reliability.

pith-pipeline@v0.9.0 · 5578 in / 1121 out tokens · 53239 ms · 2026-05-08T16:18:36.150231+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Technical report, World Economic Forum (2025)

Forum, W.E.: The global risks report 2025. Technical report, World Economic Forum (2025)

2025
[2]

Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024)

(ISC)²: Global cybersecurity workforce prepares for an ai-driven world. Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024). https://edu. arrow.com/media/wtjfmszx/2024-isc2-wfs.pdf

2024
[3]

High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc

Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc. 2024.100211 32

work page doi:10.1016/j.hcc 2024
[4]

Authorea Preprints3(2023)

Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023)

2023
[5]

Journal of Electrical Systems and Information Technology11(1), 22 (2024)

Akhtar, Z.B.: Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond. Journal of Electrical Systems and Information Technology11(1), 22 (2024)

2021
[6]

Iris: Llm-assisted static analysis for detecting security vulnerabilities

Li, Z., Dutta, S., Naik, M.: Llm-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238 (2024)

work page arXiv 2024
[7]

Considerations for evaluating large language models for cybersecurity tasks (2024)

Gennari, J., Lau, S.-h., Perl, S., Parish, J., Sastry, G.: Considerations for evaluat- ing large language models for cybersecurity tasks. Considerations for evaluating large language models for cybersecurity tasks (2024)

2024
[8]

Available at SSRN 4853709 (2024)

Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N.: Gen- erative ai and large language models for cyber security: All insights you need. Available at SSRN 4853709 (2024)

2024
[9]

arXiv preprint arXiv:2403.13309 (2024)

Pankajakshan, R., Biswal, S., Govindarajulu, Y., Gressel, G.: Mapping llm secu- rity landscapes: A comprehensive stakeholder risk assessment proposal. arXiv preprint arXiv:2403.13309 (2024)

work page arXiv 2024
[10]

Technical report, Center for Internet Security (2021)

Center for Internet Security: Cis critical security controls version 8. Technical report, Center for Internet Security (2021). https://www.cisecurity.org/controls

2021
[11]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)

work page internal anchor Pith review arXiv 2023
[12]

A Comprehensive Overview of Large Language Models

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 (2023)

work page internal anchor Pith review arXiv 2023
[13]

Communications of the ACM63(11), 139–144 (2020)

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

2020
[14]

In: Balas, V.E., Semwal, V.B., Khandare, A

Barreto, F., Moharkar, L., Shirodkar, M., Sarode, V., Gonsalves, S., Johns, A.: Generative artificial intelligence: Opportunities and challenges of large language models. In: Balas, V.E., Semwal, V.B., Khandare, A. (eds.) Intelligent Computing and Networking, pp. 545–553. Springer, Singapore (2023)

2023
[15]

Authorea Preprints3(2023) 33

Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023) 33

2023
[16]

Computers & Security144, 103964 (2024)

McIntosh, T.R., Susnjak, T., Liu, T., Watters, P., Xu, D., Liu, D., Nowrozy, R., Halgamuge, M.N.: From cobit to iso 42001: Evaluating cybersecurity frame- works for opportunities, risks, and regulatory compliance in commercializing large language models. Computers & Security144, 103964 (2024)

2024
[17]

Technical report, National Insti- tute of Standards and Technology (2024)

National Institute of Standards and Technology (NIST): Framework for improving critical infrastructure cybersecurity, version 2.0. Technical report, National Insti- tute of Standards and Technology (2024). https://www.nist.gov/cyberframework

2024
[18]

Technical report, International Organization for Standardization (ISO) (2022)

Standardization (ISO), I.O.: Iso/iec 27001:2022 - information security, cyberse- curity and privacy protection. Technical report, International Organization for Standardization (ISO) (2022). https://www.iso.org/standard/27001

2022
[19]

https://arxiv.org/abs/2405

Bhusal, D., Alam, M.T., Nguyen, L., Mahara, A., Lightcap, Z., Frazier, R., Fieblinger, R., Torales, G.L., Blakely, B.A., Rastogi, N.: SECURE: Benchmark- ing Large Language Models for Cybersecurity (2024). https://arxiv.org/abs/2405. 20441

2024
[20]

https://arxiv.org/ abs/2504.14039

Veuthey, J.R., Majid, Z.A., Hariharan, S., Haimes, J.: MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks (2025). https://arxiv.org/ abs/2504.14039

work page arXiv 2025
[21]

In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp

Tihanyi, N., Ferrag, M.A., Jain, R., Bisztray, T., Debbah, M.: Cybermetric: a benchmark dataset based on retrieval-augmented generation for evaluating llms in cybersecurity knowledge. In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 296–302 (2024). IEEE

2024
[22]

Business Horizons63(3), 325–337.https://doi.org/10.1016/j.bushor

Benz, M., Chatterjee, D.: Calculated risk? a cybersecurity evaluation tool for smes. Business Horizons63(4), 531–540 (2020) https://doi.org/10.1016/j.bushor. 2020.03.010

work page doi:10.1016/j.bushor 2020
[23]

Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362

Agrawal, G., Pal, K., Deng, Y., Liu, H., Chen, Y.-C.: Cyberq: Generating ques- tions and answers for cybersecurity education using knowledge graph-augmented llms. Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362

work page doi:10.1609/aaai.v38i21.30362 2024
[24]

IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469

Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C., Debbah, M., Lestable, T., Thandi, N.S.: Revolutionizing cyber threat detection with large language mod- els: A privacy-preserving bert-based lightweight model for iot/iiot devices. IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469

work page doi:10.1109/access.2024.3363469 2024
[25]

Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10

Levi, M., Allouche, Y., Ohayon, D., Puzanov, A.: Cyberpal.ai: Empowering llms with expert-driven cybersecurity instructions. Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10. 1609/aaai.v39i23.34618 34

2025
[26]

In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M

Papachristofis, K., Vardoulias, G., Vavousis, K.: Comparative evaluation of cyber- security maturity models and frameworks. In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M. (eds.) Information Systems, pp. 166–178. Springer, Cham (2025)

2025
[27]

In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security

Haastrecht, M., Sarhan, I., Shojaifar, A., Baumgartner, L., Mallouli, W., Spruit, M.: A threat-based cybersecurity risk assessment approach addressing sme needs. In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security. ARES ’21. Association for Computing Machin- ery, New York, NY, USA (2021). https://doi.org/10.114...

work page doi:10.1145/3465481.3469199 2021
[28]

https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35

ScienceDirect: Paired Sample t-test. https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35

[1] [1]

Technical report, World Economic Forum (2025)

Forum, W.E.: The global risks report 2025. Technical report, World Economic Forum (2025)

2025

[2] [2]

Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024)

(ISC)²: Global cybersecurity workforce prepares for an ai-driven world. Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024). https://edu. arrow.com/media/wtjfmszx/2024-isc2-wfs.pdf

2024

[3] [3]

High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc

Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc. 2024.100211 32

work page doi:10.1016/j.hcc 2024

[4] [4]

Authorea Preprints3(2023)

Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023)

2023

[5] [5]

Journal of Electrical Systems and Information Technology11(1), 22 (2024)

Akhtar, Z.B.: Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond. Journal of Electrical Systems and Information Technology11(1), 22 (2024)

2021

[6] [6]

Iris: Llm-assisted static analysis for detecting security vulnerabilities

Li, Z., Dutta, S., Naik, M.: Llm-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238 (2024)

work page arXiv 2024

[7] [7]

Considerations for evaluating large language models for cybersecurity tasks (2024)

Gennari, J., Lau, S.-h., Perl, S., Parish, J., Sastry, G.: Considerations for evaluat- ing large language models for cybersecurity tasks. Considerations for evaluating large language models for cybersecurity tasks (2024)

2024

[8] [8]

Available at SSRN 4853709 (2024)

Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N.: Gen- erative ai and large language models for cyber security: All insights you need. Available at SSRN 4853709 (2024)

2024

[9] [9]

arXiv preprint arXiv:2403.13309 (2024)

Pankajakshan, R., Biswal, S., Govindarajulu, Y., Gressel, G.: Mapping llm secu- rity landscapes: A comprehensive stakeholder risk assessment proposal. arXiv preprint arXiv:2403.13309 (2024)

work page arXiv 2024

[10] [10]

Technical report, Center for Internet Security (2021)

Center for Internet Security: Cis critical security controls version 8. Technical report, Center for Internet Security (2021). https://www.cisecurity.org/controls

2021

[11] [11]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)

work page internal anchor Pith review arXiv 2023

[12] [12]

A Comprehensive Overview of Large Language Models

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 (2023)

work page internal anchor Pith review arXiv 2023

[13] [13]

Communications of the ACM63(11), 139–144 (2020)

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

2020

[14] [14]

In: Balas, V.E., Semwal, V.B., Khandare, A

Barreto, F., Moharkar, L., Shirodkar, M., Sarode, V., Gonsalves, S., Johns, A.: Generative artificial intelligence: Opportunities and challenges of large language models. In: Balas, V.E., Semwal, V.B., Khandare, A. (eds.) Intelligent Computing and Networking, pp. 545–553. Springer, Singapore (2023)

2023

[15] [15]

Authorea Preprints3(2023) 33

Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023) 33

2023

[16] [16]

Computers & Security144, 103964 (2024)

McIntosh, T.R., Susnjak, T., Liu, T., Watters, P., Xu, D., Liu, D., Nowrozy, R., Halgamuge, M.N.: From cobit to iso 42001: Evaluating cybersecurity frame- works for opportunities, risks, and regulatory compliance in commercializing large language models. Computers & Security144, 103964 (2024)

2024

[17] [17]

Technical report, National Insti- tute of Standards and Technology (2024)

National Institute of Standards and Technology (NIST): Framework for improving critical infrastructure cybersecurity, version 2.0. Technical report, National Insti- tute of Standards and Technology (2024). https://www.nist.gov/cyberframework

2024

[18] [18]

Technical report, International Organization for Standardization (ISO) (2022)

Standardization (ISO), I.O.: Iso/iec 27001:2022 - information security, cyberse- curity and privacy protection. Technical report, International Organization for Standardization (ISO) (2022). https://www.iso.org/standard/27001

2022

[19] [19]

https://arxiv.org/abs/2405

Bhusal, D., Alam, M.T., Nguyen, L., Mahara, A., Lightcap, Z., Frazier, R., Fieblinger, R., Torales, G.L., Blakely, B.A., Rastogi, N.: SECURE: Benchmark- ing Large Language Models for Cybersecurity (2024). https://arxiv.org/abs/2405. 20441

2024

[20] [20]

https://arxiv.org/ abs/2504.14039

Veuthey, J.R., Majid, Z.A., Hariharan, S., Haimes, J.: MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks (2025). https://arxiv.org/ abs/2504.14039

work page arXiv 2025

[21] [21]

In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp

Tihanyi, N., Ferrag, M.A., Jain, R., Bisztray, T., Debbah, M.: Cybermetric: a benchmark dataset based on retrieval-augmented generation for evaluating llms in cybersecurity knowledge. In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 296–302 (2024). IEEE

2024

[22] [22]

Business Horizons63(3), 325–337.https://doi.org/10.1016/j.bushor

Benz, M., Chatterjee, D.: Calculated risk? a cybersecurity evaluation tool for smes. Business Horizons63(4), 531–540 (2020) https://doi.org/10.1016/j.bushor. 2020.03.010

work page doi:10.1016/j.bushor 2020

[23] [23]

Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362

Agrawal, G., Pal, K., Deng, Y., Liu, H., Chen, Y.-C.: Cyberq: Generating ques- tions and answers for cybersecurity education using knowledge graph-augmented llms. Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362

work page doi:10.1609/aaai.v38i21.30362 2024

[24] [24]

IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469

Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C., Debbah, M., Lestable, T., Thandi, N.S.: Revolutionizing cyber threat detection with large language mod- els: A privacy-preserving bert-based lightweight model for iot/iiot devices. IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469

work page doi:10.1109/access.2024.3363469 2024

[25] [25]

Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10

Levi, M., Allouche, Y., Ohayon, D., Puzanov, A.: Cyberpal.ai: Empowering llms with expert-driven cybersecurity instructions. Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10. 1609/aaai.v39i23.34618 34

2025

[26] [26]

In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M

Papachristofis, K., Vardoulias, G., Vavousis, K.: Comparative evaluation of cyber- security maturity models and frameworks. In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M. (eds.) Information Systems, pp. 166–178. Springer, Cham (2025)

2025

[27] [27]

In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security

Haastrecht, M., Sarhan, I., Shojaifar, A., Baumgartner, L., Mallouli, W., Spruit, M.: A threat-based cybersecurity risk assessment approach addressing sme needs. In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security. ARES ’21. Association for Computing Machin- ery, New York, NY, USA (2021). https://doi.org/10.114...

work page doi:10.1145/3465481.3469199 2021

[28] [28]

https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35

ScienceDirect: Paired Sample t-test. https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35