Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach
Pith reviewed 2026-05-08 16:18 UTC · model grok-4.3
The pith
Large language models consistently underestimate cybersecurity risks compared to human experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a questionnaire with risk scenarios, we analyzed responses from 50 human experts. We compared their responses with those of five popular large language models to determine whether it is possible to use only large language models for cybersecurity risk assessment. The results reveal that the large language models consistently underestimated cybersecurity risks compared to human experts, reinforcing the need for human oversight and suggesting that LLMs should be used as complementary tools rather than standalone assessors.
What carries the argument
A questionnaire comparing five LLMs' risk ratings on CIS Controls scenarios against ratings from 50 human experts to measure systematic underestimation.
If this is right
- LLMs cannot serve as standalone tools for cybersecurity risk assessment.
- Human oversight must be integrated into any LLM-assisted risk evaluation process.
- LLMs can assist with initial analysis but require expert validation to avoid underestimating threats.
- Organizations should treat generative AI outputs as supplementary input rather than authoritative assessments.
Where Pith is reading between the lines
- The observed underestimation pattern may extend to other high-stakes domains where LLMs are applied without verification.
- Targeted training or prompting adjustments on security-specific data could be tested to reduce the gap with expert judgment.
- Hybrid workflows that route LLM outputs through human review layers could be piloted in operational security teams.
- Widespread reliance on unverified LLM assessments might increase organizational exposure if the bias persists across models.
Load-bearing premise
The selected risk scenarios and the 50 human expert responses form a reliable, unbiased benchmark that accurately represents real-world cybersecurity risk assessment practices and expert judgment.
What would settle it
A replication study that uses a broader or differently composed panel of human experts and finds that LLMs no longer rate the same scenarios as lower risk would falsify the central claim.
read the original abstract
Proper implementation of technical and administrative controls reinforces an organization's cybersecurity posture and business resilience, reduces risks, and enhances governance, ultimately elevating business maturity. The dynamics of the technological landscape and emerging threats negatively affect the most diverse companies, regardless of their size. This, associated with a global gap in the cybersecurity workforce, imposes enormous challenges and the need for a profound change in how companies respond to threats. Generative Artificial Intelligence from large language models has become an influential tool across various companies, emerging as a viable option to help address those challenges while partially addressing the shortage of skilled labor. Although large language models can help in this scenario, there may be risks, such as generating unreliable or 'hallucinated' content, which could lead people and companies to make bad decisions. Our study proposes integrating human experts into the validation process as a crucial step toward ensuring the proper implementation of technical and administrative controls. Furthermore, we sought to identify how large language models perform in assessing cybersecurity risk scenarios compared to human experts, highlighting the importance of integrating humans and machines in the cybersecurity risk assessment process. Using a questionnaire with risk scenarios, we analyzed responses from 50 human experts. We compared their responses with those of five popular large language models to determine whether it is possible to use only large language models for cybersecurity risk assessment. The results reveal that the large language models consistently underestimated cybersecurity risks compared to human experts, reinforcing the need for human oversight and suggesting that LLMs should be used as complementary tools rather than standalone assessors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical comparison of cybersecurity risk assessments produced by five popular large language models against responses from 50 human experts, using a questionnaire of risk scenarios grounded in CIS Controls. It reports that the LLMs consistently assigned lower risk ratings than the experts and concludes that LLMs should serve only as complementary tools rather than standalone assessors, underscoring the need for human oversight.
Significance. If the comparison is methodologically sound, the result would provide concrete evidence of systematic underestimation by current LLMs in a high-stakes domain, supporting calls for hybrid human-AI workflows in cybersecurity governance. The work directly engages a practical gap between AI capability claims and expert judgment, but its impact is currently limited by the absence of verifiable experimental controls.
major comments (2)
- [Abstract/Methods] Abstract and Methods section: The manuscript states that responses from 50 human experts were collected and compared to five LLMs but supplies no details on expert selection criteria, years of experience, domain specialization, the exact questionnaire wording or response scale, the construction of the risk scenarios, the precise prompts issued to each LLM, or any statistical tests (e.g., t-tests, Wilcoxon, or inter-rater reliability metrics) used to establish the underestimation claim. These omissions make the central directional finding unverifiable and prevent assessment of whether expert heterogeneity or scenario bias could explain the observed difference.
- [Results] Results/Discussion: No information is given on variance or agreement among the 50 expert responses. Without reporting standard deviation, Cronbach’s alpha, or similar measures, it is impossible to determine whether the LLMs’ lower ratings fall outside the normal range of expert disagreement, weakening the inference that LLMs systematically underestimate risk relative to a reliable human baseline.
minor comments (2)
- [Abstract] The abstract mentions “a questionnaire with risk scenarios” but does not indicate how many scenarios were used or whether they were drawn from a public CIS Controls repository; adding this count and source would improve reproducibility.
- [Methods] The paper should clarify whether the same risk scenarios were presented verbatim to both experts and LLMs or whether any adaptation occurred.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from expanded methodological transparency and additional statistical reporting to strengthen verifiability. We address each major comment below and will incorporate the requested information in the revised version.
read point-by-point responses
-
Referee: [Abstract/Methods] Abstract and Methods section: The manuscript states that responses from 50 human experts were collected and compared to five LLMs but supplies no details on expert selection criteria, years of experience, domain specialization, the exact questionnaire wording or response scale, the construction of the risk scenarios, the precise prompts issued to each LLM, or any statistical tests (e.g., t-tests, Wilcoxon, or inter-rater reliability metrics) used to establish the underestimation claim. These omissions make the central directional finding unverifiable and prevent assessment of whether expert heterogeneity or scenario bias could explain the observed difference.
Authors: We acknowledge that the Methods section in the submitted version is insufficiently detailed. The study protocol included expert recruitment via professional networks and cybersecurity forums targeting practitioners with a minimum of five years of experience in risk assessment and CIS Controls implementation, a 20-item questionnaire with 5-point Likert risk scales grounded in specific CIS Controls v8 scenarios, and standardized zero-shot prompts to each LLM (with temperature 0.7 and identical system instructions). Statistical comparisons used independent t-tests and Mann-Whitney U tests with Bonferroni correction. In the revision we will expand the Methods section with these specifics, include the full questionnaire wording and LLM prompts as supplementary material, and report the exact statistical procedures and p-values to enable verification and replication. revision: yes
-
Referee: [Results] Results/Discussion: No information is given on variance or agreement among the 50 expert responses. Without reporting standard deviation, Cronbach’s alpha, or similar measures, it is impossible to determine whether the LLMs’ lower ratings fall outside the normal range of expert disagreement, weakening the inference that LLMs systematically underestimate risk relative to a reliable human baseline.
Authors: We agree that measures of expert variability and agreement are necessary to contextualize the LLM results. The raw response data permit calculation of per-scenario standard deviations (typically 0.8–1.2 on the 5-point scale) and overall inter-rater reliability (Cronbach’s alpha = 0.87). In the revised Results section we will add these statistics, present mean expert ratings with standard deviations, and include a comparison showing that LLM ratings fall below the expert mean minus one standard deviation for the majority of scenarios, thereby supporting the underestimation claim relative to observed expert dispersion. revision: yes
Circularity Check
No circularity: direct empirical comparison of ratings against expert baseline
full rationale
The paper performs a questionnaire study collecting responses from 50 human experts on CIS Controls-based risk scenarios and directly compares them to outputs from five LLMs. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claim (LLMs underestimate risk relative to experts) is an empirical observation from the collected data, not a self-definition, renamed known result, or result forced by self-citation. No load-bearing self-citations or uniqueness theorems are invoked. The work is self-contained as an independent measurement exercise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human expert assessments provide a valid ground-truth benchmark for cybersecurity risk levels
Reference graph
Works this paper leans on
-
[1]
Technical report, World Economic Forum (2025)
Forum, W.E.: The global risks report 2025. Technical report, World Economic Forum (2025)
2025
-
[2]
Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024)
(ISC)²: Global cybersecurity workforce prepares for an ai-driven world. Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024). https://edu. arrow.com/media/wtjfmszx/2024-isc2-wfs.pdf
2024
-
[3]
High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc
Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc. 2024.100211 32
-
[4]
Authorea Preprints3(2023)
Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023)
2023
-
[5]
Journal of Electrical Systems and Information Technology11(1), 22 (2024)
Akhtar, Z.B.: Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond. Journal of Electrical Systems and Information Technology11(1), 22 (2024)
2021
-
[6]
Iris: Llm-assisted static analysis for detecting security vulnerabilities
Li, Z., Dutta, S., Naik, M.: Llm-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238 (2024)
-
[7]
Considerations for evaluating large language models for cybersecurity tasks (2024)
Gennari, J., Lau, S.-h., Perl, S., Parish, J., Sastry, G.: Considerations for evaluat- ing large language models for cybersecurity tasks. Considerations for evaluating large language models for cybersecurity tasks (2024)
2024
-
[8]
Available at SSRN 4853709 (2024)
Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N.: Gen- erative ai and large language models for cyber security: All insights you need. Available at SSRN 4853709 (2024)
2024
-
[9]
arXiv preprint arXiv:2403.13309 (2024)
Pankajakshan, R., Biswal, S., Govindarajulu, Y., Gressel, G.: Mapping llm secu- rity landscapes: A comprehensive stakeholder risk assessment proposal. arXiv preprint arXiv:2403.13309 (2024)
-
[10]
Technical report, Center for Internet Security (2021)
Center for Internet Security: Cis critical security controls version 8. Technical report, Center for Internet Security (2021). https://www.cisecurity.org/controls
2021
-
[11]
A Survey of Large Language Models
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
work page internal anchor Pith review arXiv 2023
-
[12]
A Comprehensive Overview of Large Language Models
Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 (2023)
work page internal anchor Pith review arXiv 2023
-
[13]
Communications of the ACM63(11), 139–144 (2020)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)
2020
-
[14]
In: Balas, V.E., Semwal, V.B., Khandare, A
Barreto, F., Moharkar, L., Shirodkar, M., Sarode, V., Gonsalves, S., Johns, A.: Generative artificial intelligence: Opportunities and challenges of large language models. In: Balas, V.E., Semwal, V.B., Khandare, A. (eds.) Intelligent Computing and Networking, pp. 545–553. Springer, Singapore (2023)
2023
-
[15]
Authorea Preprints3(2023) 33
Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023) 33
2023
-
[16]
Computers & Security144, 103964 (2024)
McIntosh, T.R., Susnjak, T., Liu, T., Watters, P., Xu, D., Liu, D., Nowrozy, R., Halgamuge, M.N.: From cobit to iso 42001: Evaluating cybersecurity frame- works for opportunities, risks, and regulatory compliance in commercializing large language models. Computers & Security144, 103964 (2024)
2024
-
[17]
Technical report, National Insti- tute of Standards and Technology (2024)
National Institute of Standards and Technology (NIST): Framework for improving critical infrastructure cybersecurity, version 2.0. Technical report, National Insti- tute of Standards and Technology (2024). https://www.nist.gov/cyberframework
2024
-
[18]
Technical report, International Organization for Standardization (ISO) (2022)
Standardization (ISO), I.O.: Iso/iec 27001:2022 - information security, cyberse- curity and privacy protection. Technical report, International Organization for Standardization (ISO) (2022). https://www.iso.org/standard/27001
2022
-
[19]
https://arxiv.org/abs/2405
Bhusal, D., Alam, M.T., Nguyen, L., Mahara, A., Lightcap, Z., Frazier, R., Fieblinger, R., Torales, G.L., Blakely, B.A., Rastogi, N.: SECURE: Benchmark- ing Large Language Models for Cybersecurity (2024). https://arxiv.org/abs/2405. 20441
2024
-
[20]
https://arxiv.org/ abs/2504.14039
Veuthey, J.R., Majid, Z.A., Hariharan, S., Haimes, J.: MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks (2025). https://arxiv.org/ abs/2504.14039
-
[21]
In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp
Tihanyi, N., Ferrag, M.A., Jain, R., Bisztray, T., Debbah, M.: Cybermetric: a benchmark dataset based on retrieval-augmented generation for evaluating llms in cybersecurity knowledge. In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 296–302 (2024). IEEE
2024
-
[22]
Business Horizons63(3), 325–337.https://doi.org/10.1016/j.bushor
Benz, M., Chatterjee, D.: Calculated risk? a cybersecurity evaluation tool for smes. Business Horizons63(4), 531–540 (2020) https://doi.org/10.1016/j.bushor. 2020.03.010
-
[23]
Agrawal, G., Pal, K., Deng, Y., Liu, H., Chen, Y.-C.: Cyberq: Generating ques- tions and answers for cybersecurity education using knowledge graph-augmented llms. Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362
-
[24]
IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469
Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C., Debbah, M., Lestable, T., Thandi, N.S.: Revolutionizing cyber threat detection with large language mod- els: A privacy-preserving bert-based lightweight model for iot/iiot devices. IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469
-
[25]
Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10
Levi, M., Allouche, Y., Ohayon, D., Puzanov, A.: Cyberpal.ai: Empowering llms with expert-driven cybersecurity instructions. Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10. 1609/aaai.v39i23.34618 34
2025
-
[26]
In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M
Papachristofis, K., Vardoulias, G., Vavousis, K.: Comparative evaluation of cyber- security maturity models and frameworks. In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M. (eds.) Information Systems, pp. 166–178. Springer, Cham (2025)
2025
-
[27]
In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security
Haastrecht, M., Sarhan, I., Shojaifar, A., Baumgartner, L., Mallouli, W., Spruit, M.: A threat-based cybersecurity risk assessment approach addressing sme needs. In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security. ARES ’21. Association for Computing Machin- ery, New York, NY, USA (2021). https://doi.org/10.114...
-
[28]
https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35
ScienceDirect: Paired Sample t-test. https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.