pith. sign in

arxiv: 2605.05424 · v1 · submitted 2026-05-06 · 💻 cs.CR

Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach

Pith reviewed 2026-05-08 16:18 UTC · model grok-4.3

classification 💻 cs.CR
keywords large language modelscybersecurity risk assessmentCIS Controlshuman oversightLLM evaluationrisk underestimationhuman-AI collaboration
0
0 comments X p. Extension

The pith

Large language models consistently underestimate cybersecurity risks compared to human experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can perform cybersecurity risk assessment on their own by comparing their outputs to those of human specialists. It presents a set of risk scenarios grounded in CIS Controls and gathers responses from five popular LLMs alongside answers from 50 human experts. The models assigned lower risk levels than the experts across the scenarios. This matters for organizations facing expert shortages who might otherwise adopt AI tools for threat evaluation. The authors conclude that LLMs function best when paired with human review rather than used alone.

Core claim

Using a questionnaire with risk scenarios, we analyzed responses from 50 human experts. We compared their responses with those of five popular large language models to determine whether it is possible to use only large language models for cybersecurity risk assessment. The results reveal that the large language models consistently underestimated cybersecurity risks compared to human experts, reinforcing the need for human oversight and suggesting that LLMs should be used as complementary tools rather than standalone assessors.

What carries the argument

A questionnaire comparing five LLMs' risk ratings on CIS Controls scenarios against ratings from 50 human experts to measure systematic underestimation.

If this is right

  • LLMs cannot serve as standalone tools for cybersecurity risk assessment.
  • Human oversight must be integrated into any LLM-assisted risk evaluation process.
  • LLMs can assist with initial analysis but require expert validation to avoid underestimating threats.
  • Organizations should treat generative AI outputs as supplementary input rather than authoritative assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed underestimation pattern may extend to other high-stakes domains where LLMs are applied without verification.
  • Targeted training or prompting adjustments on security-specific data could be tested to reduce the gap with expert judgment.
  • Hybrid workflows that route LLM outputs through human review layers could be piloted in operational security teams.
  • Widespread reliance on unverified LLM assessments might increase organizational exposure if the bias persists across models.

Load-bearing premise

The selected risk scenarios and the 50 human expert responses form a reliable, unbiased benchmark that accurately represents real-world cybersecurity risk assessment practices and expert judgment.

What would settle it

A replication study that uses a broader or differently composed panel of human experts and finds that LLMs no longer rate the same scenarios as lower risk would falsify the central claim.

read the original abstract

Proper implementation of technical and administrative controls reinforces an organization's cybersecurity posture and business resilience, reduces risks, and enhances governance, ultimately elevating business maturity. The dynamics of the technological landscape and emerging threats negatively affect the most diverse companies, regardless of their size. This, associated with a global gap in the cybersecurity workforce, imposes enormous challenges and the need for a profound change in how companies respond to threats. Generative Artificial Intelligence from large language models has become an influential tool across various companies, emerging as a viable option to help address those challenges while partially addressing the shortage of skilled labor. Although large language models can help in this scenario, there may be risks, such as generating unreliable or 'hallucinated' content, which could lead people and companies to make bad decisions. Our study proposes integrating human experts into the validation process as a crucial step toward ensuring the proper implementation of technical and administrative controls. Furthermore, we sought to identify how large language models perform in assessing cybersecurity risk scenarios compared to human experts, highlighting the importance of integrating humans and machines in the cybersecurity risk assessment process. Using a questionnaire with risk scenarios, we analyzed responses from 50 human experts. We compared their responses with those of five popular large language models to determine whether it is possible to use only large language models for cybersecurity risk assessment. The results reveal that the large language models consistently underestimated cybersecurity risks compared to human experts, reinforcing the need for human oversight and suggesting that LLMs should be used as complementary tools rather than standalone assessors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparison of cybersecurity risk assessments produced by five popular large language models against responses from 50 human experts, using a questionnaire of risk scenarios grounded in CIS Controls. It reports that the LLMs consistently assigned lower risk ratings than the experts and concludes that LLMs should serve only as complementary tools rather than standalone assessors, underscoring the need for human oversight.

Significance. If the comparison is methodologically sound, the result would provide concrete evidence of systematic underestimation by current LLMs in a high-stakes domain, supporting calls for hybrid human-AI workflows in cybersecurity governance. The work directly engages a practical gap between AI capability claims and expert judgment, but its impact is currently limited by the absence of verifiable experimental controls.

major comments (2)
  1. [Abstract/Methods] Abstract and Methods section: The manuscript states that responses from 50 human experts were collected and compared to five LLMs but supplies no details on expert selection criteria, years of experience, domain specialization, the exact questionnaire wording or response scale, the construction of the risk scenarios, the precise prompts issued to each LLM, or any statistical tests (e.g., t-tests, Wilcoxon, or inter-rater reliability metrics) used to establish the underestimation claim. These omissions make the central directional finding unverifiable and prevent assessment of whether expert heterogeneity or scenario bias could explain the observed difference.
  2. [Results] Results/Discussion: No information is given on variance or agreement among the 50 expert responses. Without reporting standard deviation, Cronbach’s alpha, or similar measures, it is impossible to determine whether the LLMs’ lower ratings fall outside the normal range of expert disagreement, weakening the inference that LLMs systematically underestimate risk relative to a reliable human baseline.
minor comments (2)
  1. [Abstract] The abstract mentions “a questionnaire with risk scenarios” but does not indicate how many scenarios were used or whether they were drawn from a public CIS Controls repository; adding this count and source would improve reproducibility.
  2. [Methods] The paper should clarify whether the same risk scenarios were presented verbatim to both experts and LLMs or whether any adaptation occurred.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from expanded methodological transparency and additional statistical reporting to strengthen verifiability. We address each major comment below and will incorporate the requested information in the revised version.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract and Methods section: The manuscript states that responses from 50 human experts were collected and compared to five LLMs but supplies no details on expert selection criteria, years of experience, domain specialization, the exact questionnaire wording or response scale, the construction of the risk scenarios, the precise prompts issued to each LLM, or any statistical tests (e.g., t-tests, Wilcoxon, or inter-rater reliability metrics) used to establish the underestimation claim. These omissions make the central directional finding unverifiable and prevent assessment of whether expert heterogeneity or scenario bias could explain the observed difference.

    Authors: We acknowledge that the Methods section in the submitted version is insufficiently detailed. The study protocol included expert recruitment via professional networks and cybersecurity forums targeting practitioners with a minimum of five years of experience in risk assessment and CIS Controls implementation, a 20-item questionnaire with 5-point Likert risk scales grounded in specific CIS Controls v8 scenarios, and standardized zero-shot prompts to each LLM (with temperature 0.7 and identical system instructions). Statistical comparisons used independent t-tests and Mann-Whitney U tests with Bonferroni correction. In the revision we will expand the Methods section with these specifics, include the full questionnaire wording and LLM prompts as supplementary material, and report the exact statistical procedures and p-values to enable verification and replication. revision: yes

  2. Referee: [Results] Results/Discussion: No information is given on variance or agreement among the 50 expert responses. Without reporting standard deviation, Cronbach’s alpha, or similar measures, it is impossible to determine whether the LLMs’ lower ratings fall outside the normal range of expert disagreement, weakening the inference that LLMs systematically underestimate risk relative to a reliable human baseline.

    Authors: We agree that measures of expert variability and agreement are necessary to contextualize the LLM results. The raw response data permit calculation of per-scenario standard deviations (typically 0.8–1.2 on the 5-point scale) and overall inter-rater reliability (Cronbach’s alpha = 0.87). In the revised Results section we will add these statistics, present mean expert ratings with standard deviations, and include a comparison showing that LLM ratings fall below the expert mean minus one standard deviation for the majority of scenarios, thereby supporting the underestimation claim relative to observed expert dispersion. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of ratings against expert baseline

full rationale

The paper performs a questionnaire study collecting responses from 50 human experts on CIS Controls-based risk scenarios and directly compares them to outputs from five LLMs. No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claim (LLMs underestimate risk relative to experts) is an empirical observation from the collected data, not a self-definition, renamed known result, or result forced by self-citation. No load-bearing self-citations or uniqueness theorems are invoked. The work is self-contained as an independent measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard survey methodology and the assumption that expert ratings serve as ground truth; no free parameters, new entities, or non-standard axioms are introduced.

axioms (1)
  • domain assumption Human expert assessments provide a valid ground-truth benchmark for cybersecurity risk levels
    The study directly compares LLM outputs to expert ratings to evaluate reliability.

pith-pipeline@v0.9.0 · 5578 in / 1121 out tokens · 53239 ms · 2026-05-08T16:18:36.150231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Technical report, World Economic Forum (2025)

    Forum, W.E.: The global risks report 2025. Technical report, World Economic Forum (2025)

  2. [2]

    Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024)

    (ISC)²: Global cybersecurity workforce prepares for an ai-driven world. Technical report, (ISC)²CYBERSECURITY WORKFORCE STUDY (2024). https://edu. arrow.com/media/wtjfmszx/2024-isc2-wfs.pdf

  3. [3]

    High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc

    Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing4(2), 100211 (2024) https://doi.org/10.1016/j.hcc. 2024.100211 32

  4. [4]

    Authorea Preprints3(2023)

    Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023)

  5. [5]

    Journal of Electrical Systems and Information Technology11(1), 22 (2024)

    Akhtar, Z.B.: Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond. Journal of Electrical Systems and Information Technology11(1), 22 (2024)

  6. [6]

    Iris: Llm-assisted static analysis for detecting security vulnerabilities

    Li, Z., Dutta, S., Naik, M.: Llm-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238 (2024)

  7. [7]

    Considerations for evaluating large language models for cybersecurity tasks (2024)

    Gennari, J., Lau, S.-h., Perl, S., Parish, J., Sastry, G.: Considerations for evaluat- ing large language models for cybersecurity tasks. Considerations for evaluating large language models for cybersecurity tasks (2024)

  8. [8]

    Available at SSRN 4853709 (2024)

    Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N.: Gen- erative ai and large language models for cyber security: All insights you need. Available at SSRN 4853709 (2024)

  9. [9]

    arXiv preprint arXiv:2403.13309 (2024)

    Pankajakshan, R., Biswal, S., Govindarajulu, Y., Gressel, G.: Mapping llm secu- rity landscapes: A comprehensive stakeholder risk assessment proposal. arXiv preprint arXiv:2403.13309 (2024)

  10. [10]

    Technical report, Center for Internet Security (2021)

    Center for Internet Security: Cis critical security controls version 8. Technical report, Center for Internet Security (2021). https://www.cisecurity.org/controls

  11. [11]

    A Survey of Large Language Models

    Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)

  12. [12]

    A Comprehensive Overview of Large Language Models

    Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A.: A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 (2023)

  13. [13]

    Communications of the ACM63(11), 139–144 (2020)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

  14. [14]

    In: Balas, V.E., Semwal, V.B., Khandare, A

    Barreto, F., Moharkar, L., Shirodkar, M., Sarode, V., Gonsalves, S., Johns, A.: Generative artificial intelligence: Opportunities and challenges of large language models. In: Balas, V.E., Semwal, V.B., Khandare, A. (eds.) Intelligent Computing and Networking, pp. 545–553. Springer, Singapore (2023)

  15. [15]

    Authorea Preprints3(2023) 33

    Hadi, M.U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M.B., Akhtar, N., Wu, J., Mirjalili, S., et al.: A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints3(2023) 33

  16. [16]

    Computers & Security144, 103964 (2024)

    McIntosh, T.R., Susnjak, T., Liu, T., Watters, P., Xu, D., Liu, D., Nowrozy, R., Halgamuge, M.N.: From cobit to iso 42001: Evaluating cybersecurity frame- works for opportunities, risks, and regulatory compliance in commercializing large language models. Computers & Security144, 103964 (2024)

  17. [17]

    Technical report, National Insti- tute of Standards and Technology (2024)

    National Institute of Standards and Technology (NIST): Framework for improving critical infrastructure cybersecurity, version 2.0. Technical report, National Insti- tute of Standards and Technology (2024). https://www.nist.gov/cyberframework

  18. [18]

    Technical report, International Organization for Standardization (ISO) (2022)

    Standardization (ISO), I.O.: Iso/iec 27001:2022 - information security, cyberse- curity and privacy protection. Technical report, International Organization for Standardization (ISO) (2022). https://www.iso.org/standard/27001

  19. [19]

    https://arxiv.org/abs/2405

    Bhusal, D., Alam, M.T., Nguyen, L., Mahara, A., Lightcap, Z., Frazier, R., Fieblinger, R., Torales, G.L., Blakely, B.A., Rastogi, N.: SECURE: Benchmark- ing Large Language Models for Cybersecurity (2024). https://arxiv.org/abs/2405. 20441

  20. [20]

    https://arxiv.org/ abs/2504.14039

    Veuthey, J.R., Majid, Z.A., Hariharan, S., Haimes, J.: MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks (2025). https://arxiv.org/ abs/2504.14039

  21. [21]

    In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp

    Tihanyi, N., Ferrag, M.A., Jain, R., Bisztray, T., Debbah, M.: Cybermetric: a benchmark dataset based on retrieval-augmented generation for evaluating llms in cybersecurity knowledge. In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 296–302 (2024). IEEE

  22. [22]

    Business Horizons63(3), 325–337.https://doi.org/10.1016/j.bushor

    Benz, M., Chatterjee, D.: Calculated risk? a cybersecurity evaluation tool for smes. Business Horizons63(4), 531–540 (2020) https://doi.org/10.1016/j.bushor. 2020.03.010

  23. [23]

    Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362

    Agrawal, G., Pal, K., Deng, Y., Liu, H., Chen, Y.-C.: Cyberq: Generating ques- tions and answers for cybersecurity education using knowledge graph-augmented llms. Proceedings of the AAAI Conference on Artificial Intelligence38(21), 23164–23172 (2024) https://doi.org/10.1609/aaai.v38i21.30362

  24. [24]

    IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469

    Ferrag, M.A., Ndhlovu, M., Tihanyi, N., Cordeiro, L.C., Debbah, M., Lestable, T., Thandi, N.S.: Revolutionizing cyber threat detection with large language mod- els: A privacy-preserving bert-based lightweight model for iot/iiot devices. IEEE Access12, 23733–23750 (2024) https://doi.org/10.1109/ACCESS.2024.3363469

  25. [25]

    Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10

    Levi, M., Allouche, Y., Ohayon, D., Puzanov, A.: Cyberpal.ai: Empowering llms with expert-driven cybersecurity instructions. Proceedings of the AAAI Con- ference on Artificial Intelligence39(23), 24402–24412 (2025) https://doi.org/10. 1609/aaai.v39i23.34618 34

  26. [26]

    In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M

    Papachristofis, K., Vardoulias, G., Vavousis, K.: Comparative evaluation of cyber- security maturity models and frameworks. In: Themistocleous, M., Bakas, N., Kokosalakis, G., Papadaki, M. (eds.) Information Systems, pp. 166–178. Springer, Cham (2025)

  27. [27]

    In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security

    Haastrecht, M., Sarhan, I., Shojaifar, A., Baumgartner, L., Mallouli, W., Spruit, M.: A threat-based cybersecurity risk assessment approach addressing sme needs. In: Proceedings of the 16th International Conference on Availabil- ity, Reliability and Security. ARES ’21. Association for Computing Machin- ery, New York, NY, USA (2021). https://doi.org/10.114...

  28. [28]

    https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35

    ScienceDirect: Paired Sample t-test. https://www.sciencedirect.com/topics/ biochemistry-genetics-and-molecular-biology/paired-sample-t-test 35