QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild
Pith reviewed 2026-05-18 10:53 UTC · model grok-4.3
The pith
QuiLL provides the first end-to-end framework to evaluate large language models on detecting vulnerabilities in actual production code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QuiLL is the first comprehensive evaluation framework for real-world vulnerability detection. It consists of an end-to-end pipeline that draws together diverse prompt designs for vulnerability detection and reasoning, a real-world vector data store constructed from the National Vulnerability Database to provide dynamic in-context learning, and a novel scoring metric which quantifies accuracy and reasoning quality of model predictions. This setup enables researchers to benchmark and compare the vulnerability detection capabilities of various LLMs and assess their readiness for deployment in actual code production pipelines.
What carries the argument
The QuiLL pipeline, which integrates diverse prompt designs, a vector store of real vulnerabilities from the National Vulnerability Database for dynamic retrieval, and a custom metric that scores both detection accuracy and reasoning quality.
If this is right
- Researchers can systematically benchmark multiple LLMs on their ability to detect vulnerabilities in real-world settings.
- The framework allows direct assessment of which models are ready for use inside actual code production pipelines.
- Optimization techniques can be tested specifically against the complexities of live vulnerability detection rather than synthetic cases.
- Dynamic in-context learning from the NVD store supplies relevant examples that improve the relevance of model outputs during evaluation.
Where Pith is reading between the lines
- The same pipeline structure could be adapted to evaluate LLMs on related security tasks such as automated patch generation or exploit prediction.
- Adding more recent or specialized vulnerability data to the vector store might improve detection of emerging threats not well covered in the original NVD collection.
- Wider adoption of the accuracy-plus-reasoning scoring approach could raise standards for how LLM security evaluations measure explanation quality in other domains.
Load-bearing premise
That combining prompt engineering, retrieval from an NVD-derived vector store, and the proposed scoring metric will produce reliable and generalizable assessments of LLM performance on actual production code rather than only on curated examples.
What would settle it
A controlled experiment in which LLMs that receive high QuiLL scores are applied to a fresh set of real production codebases and show detection rates no better than baseline models without the framework's components.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated exceptional progress in multiple domains of software engineering including software vulnerability detection. Using LLMs to automate vulnerability detection in the wild is an important and relatively under-explored problem. In this paper we propose QuiLL, the first comprehensive evaluation framework for real-world vulnerability detection. Our solution consists of an end-to-end pipeline that draws together cutting-edge LLM optimization techniques and strategies specifically catering to the complexities of real-world vulnerability detection. Our specific contributions include (i) diverse prompt designs for vulnerability detection and reasoning (ii) a real-world vector data store constructed from the National Vulnerability Database to provide dynamic in-context learning, and (iii) a novel scoring metric which quantifies accuracy and reasoning quality of model predictions. QuiLL enables researchers to easily and systematically benchmark and compare the vulnerability detection capabilities of various LLMs and assess their readiness for deployment in actual code production pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents QuiLL as the first comprehensive evaluation framework for real-world vulnerability detection using LLMs. It features an end-to-end pipeline incorporating diverse prompt designs for detection and reasoning, a vector data store built from the National Vulnerability Database (NVD) enabling dynamic in-context learning, and a novel scoring metric to quantify both the accuracy and reasoning quality of model outputs. The framework is intended to allow systematic benchmarking of LLMs to assess their suitability for integration into actual software production pipelines.
Significance. Should the proposed pipeline and metric prove effective through rigorous testing on diverse, real-world codebases, QuiLL could fill an important gap by providing a standardized, reproducible method for evaluating LLM-based vulnerability detection tools. The use of NVD-derived data for in-context examples is a strength that could improve generalization over synthetic datasets. However, the absence of such validation in the current manuscript means the significance is potential rather than demonstrated.
major comments (2)
- The abstract claims that QuiLL enables assessment of readiness for deployment in actual code production pipelines, yet no quantitative results, ablation studies, or details on the scoring metric's validation are provided, making it impossible to evaluate whether the central claims are supported by evidence.
- The integration of the NVD vector store for dynamic ICL and the novel scoring metric are described at a high level, but the manuscript does not report experiments applying the full pipeline to production code or undisclosed vulnerabilities, relying instead on NVD-sourced or curated snippets. This leaves the generalizability to noisy, real-world settings untested and is a load-bearing gap for the real-world readiness claim.
minor comments (2)
- Clarify the exact formula for the novel scoring metric, including how accuracy and reasoning quality are combined, to aid reproducibility.
- Include more comparisons to existing LLM vulnerability detection benchmarks to better position the novelty of the scoring metric.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the manuscript to provide additional evidence and clarifications without overstating the current results.
read point-by-point responses
-
Referee: The abstract claims that QuiLL enables assessment of readiness for deployment in actual code production pipelines, yet no quantitative results, ablation studies, or details on the scoring metric's validation are provided, making it impossible to evaluate whether the central claims are supported by evidence.
Authors: We agree that the original manuscript would benefit from more explicit supporting evidence for the deployment-readiness claim. In the revised version we have added a new experimental section reporting quantitative benchmarking results across multiple LLMs, ablation studies isolating the contributions of prompt variants and the NVD vector store, and a dedicated validation subsection for the accuracy-plus-reasoning metric that includes comparison against expert human judgments. These additions directly address the concern and allow readers to assess the strength of the central claims. revision: yes
-
Referee: The integration of the NVD vector store for dynamic ICL and the novel scoring metric are described at a high level, but the manuscript does not report experiments applying the full pipeline to production code or undisclosed vulnerabilities, relying instead on NVD-sourced or curated snippets. This leaves the generalizability to noisy, real-world settings untested and is a load-bearing gap for the real-world readiness claim.
Authors: We acknowledge that NVD-derived snippets, while drawn from real disclosed vulnerabilities, do not fully replicate the noise and scale of proprietary production codebases. We have therefore expanded the manuscript with additional experiments that inject realistic noise (e.g., incomplete functions, mixed-language files) into the test set and have clarified the scope of the current evaluation. We maintain that the NVD vector store provides a reproducible, real-world foundation for dynamic ICL, but we have tempered the language around immediate deployment readiness to reflect the proof-of-concept nature of the reported results. revision: partial
- Direct evaluation on undisclosed vulnerabilities is inherently impossible because such vulnerabilities are not publicly available for testing.
Circularity Check
No circularity: framework assembles external components without self-referential reduction
full rationale
The manuscript presents QuiLL as an assembled pipeline of standard LLM techniques (diverse prompts, NVD-derived vector store for ICL, and a proposed scoring metric) applied to public vulnerability data. No derivation step reduces a claimed result or prediction to a fitted parameter or self-citation chain within the paper; the central contribution is the integration itself rather than a closed mathematical or empirical loop that presupposes its outputs. All supporting elements draw from external, independently verifiable sources.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform meaningful vulnerability detection and reasoning when given appropriate prompts and in-context examples from the National Vulnerability Database.
invented entities (1)
-
QuiLL scoring metric
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our specific contributions include (i) diverse prompt designs... (ii) a real-world vector data store... (iii) a novel scoring metric which quantifies accuracy and reasoning quality
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluated five leading LLMs... under both ZS and FS settings
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
L. Caruccio, S. Cirillo, G. Polese, G. Solimando, S. Sundara- murthy, and G. Tortora, “Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach,”Intelligent Systems with Applica- tions, vol. 21, p. 200336, 2024
work page 2024
-
[4]
Alibaba Group, “Qwen chat,” https://chat.qwen.ai/, 2024, ac- cessed: 2025-07-06
work page 2024
-
[5]
Deep learning- based framework for automated vulnerability detection in android applications,
R. Safdar, M. U. Ashfaq, and D. Mateen, “Deep learning- based framework for automated vulnerability detection in android applications,” in2023 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST). IEEE, 2023, pp. 1–5
work page 2023
-
[6]
Linevul: A transformer-based line-level vulnerability prediction,
M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line-level vulnerability prediction,” inProceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 608–620
work page 2022
-
[7]
What is log4j? understanding the cybersecurity vulnerability,
IBM Corporation, “What is log4j? understanding the cybersecurity vulnerability,” https://www.ibm.com/think/topics/log4j, 2023, accessed: 2025-07-19. [Online]. Available: https://www.ibm. com/think/topics/log4j
work page 2023
-
[8]
Wikipedia contributors, “2023 moveit data breach,” https: //en.wikipedia.org/wiki/2023 MOVEit data breach, 2023, ac- cessed: 2025-07-19. [Online]. Available: https://en.wikipedia.org/ wiki/2023 MOVEit data breach
work page 2023
-
[12]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[13]
S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vul- nerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,” in2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 862–880
work page 2024
-
[14]
Prompt- enhanced software vulnerability detection using chatgpt,
C. Zhang, H. Liu, J. Zeng, K. Yang, Y. Li, and H. Li, “Prompt- enhanced software vulnerability detection using chatgpt,” inPro- ceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 276–277
work page 2024
-
[15]
Understanding the effectiveness of large language models in detecting security vulnerabilities
A. Khare, S. Dutta, Z. Li, A. Solko-Breslin, R. Alur, and M. Naik, “Understanding the effectiveness of large language models in de- tecting security vulnerabilities,”arXiv preprint arXiv:2311.16169, 2023
-
[16]
How far have we gone in vulnerability detection using large language models,
Z. Gao, H. Wang, Y. Zhou, W. Zhu, and C. Zhang, “How far have we gone in vulnerability detection using large language models,” arXiv preprint arXiv:2311.12420, 2023
-
[18]
Y. Liu, L. Gao, M. Yang, Y. Xie, P. Chen, X. Zhang, and W. Chen, “Vuldetectbench: Evaluating the deep capability of vulnerability detection with large language models,”arXiv preprint arXiv:2406.07595, 2024
-
[19]
A. Zibaeirad and M. Vieira, “Vulnllmeval: A framework for eval- uating large language models in software vulnerability detection and patching,”arXiv preprint arXiv:2409.10756, 2024
-
[20]
Z. B. Akhtar, “Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond,”Journal of Electrical Systems and Information Technology, vol. 11, no. 1, p. 22, 2024. 11
work page 2021
-
[21]
National vul- nerability database (nvd),
National Institute of Standards and Technology, “National vul- nerability database (nvd),” https://nvd.nist.gov/, 2024, accessed: 2025-04-24
work page 2024
-
[22]
Bingo: Cross-architecture cross-os binary search,
M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, and H. B. K. Tan, “Bingo: Cross-architecture cross-os binary search,” inProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016, pp. 678–689
work page 2016
-
[23]
Why don’t software developers use static analysis tools to find bugs?
B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 672–681
work page 2013
-
[24]
Questions developers ask while diagnosing potential security vulnerabilities with static analysis,
J. Smith, B. Johnson, E. Murphy-Hill, B. Chu, and H. R. Lipford, “Questions developers ask while diagnosing potential security vulnerabilities with static analysis,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 248–259
work page 2015
-
[25]
J. Newsome and D. X. Song, “Dynamic taint analysis for auto- matic detection, analysis, and signaturegeneration of exploits on commodity software.” inNDSS, vol. 5. Citeseer, 2005, pp. 3–4
work page 2005
-
[26]
A hybrid analysis framework for detecting web application vulnerabilities,
M. Monga, R. Paleari, and E. Passerini, “A hybrid analysis framework for detecting web application vulnerabilities,” in2009 ICSE Workshop on Software Engineering for Secure Systems. IEEE, 2009, pp. 25–32
work page 2009
-
[27]
VulDeePecker: A Deep Learning-Based System for Vulnerability Detection
Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep learning-based system for vulnerability detection,”arXiv preprint arXiv:1801.01681, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Modeling and discovering vulnerabilities with code property graphs,
F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” in2014 IEEE symposium on security and privacy. IEEE, 2014, pp. 590– 604
work page 2014
-
[29]
Uncovering the limits of machine learning for automatic vulnerability detection,
N. Risse and M. B ¨ohme, “Uncovering the limits of machine learning for automatic vulnerability detection,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 4247–4264
work page 2024
-
[30]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Asleep at the keyboard? assessing the security of github copilot’s code contributions,
H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,”Communications of the ACM, vol. 68, no. 2, pp. 96–105, 2025
work page 2025
-
[32]
Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning,
Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, Y. Liu, and Y. Li, “Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning,”arXiv preprint arXiv:2401.16185, 2024
-
[33]
Retrieval-augmented genera- tion for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktuset al., “Retrieval-augmented genera- tion for knowledge-intensive nlp tasks,” inNeurIPS, 2020
work page 2020
-
[34]
Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,
X. Du, G. Zheng, K. Wang, Y. Zou, Y. Wang, W. Deng, J. Feng, M. Liu, B. Chen, X. Penget al., “Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,”arXiv preprint arXiv:2406.11147, 2024
-
[35]
2024 cwe top 25 most dangerous software weaknesses,
MITRE, “2024 cwe top 25 most dangerous software weaknesses,” 2024, accessed: 2025-04-24. [Online]. Available: https://cwe.mitre.org/top25/archive/2024/2024 cwe top25.html
work page 2024
-
[36]
Cvefixes: automated collection of vulnerabilities and their fixes from open-source software,
G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated collection of vulnerabilities and their fixes from open-source software,” inProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, 2021, pp. 30–39
work page 2021
-
[37]
Openai embeddings documentation,
OpenAI, “Openai embeddings documentation,” 2024, accessed: 2025-04-24. [Online]. Available: https://platform.openai.com/ docs/guides/embeddings
work page 2024
-
[38]
New embedding models and api updates,
——, “New embedding models and api updates,” https:// openai.com/blog/new-embedding-models-and-api-updates, 2024, accessed: 2025-07-25
work page 2024
-
[39]
Openai embeddings integration,
LangChain, “Openai embeddings integration,” https://js.langchain. com/docs/integrations/text embedding/openai, 2024, accessed: 2025-07-25
work page 2024
-
[40]
Using openai embeddings in llamaindex,
LlamaIndex, “Using openai embeddings in llamaindex,” https: //docs.llamaindex.ai/en/stable/examples/embeddings/OpenAI/, 2024, accessed: 2025-07-25
work page 2024
-
[43]
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.- P. Lim, “Plan-and-solve prompting: Improving zero-shot chain- of-thought reasoning by large language models,”arXiv preprint arXiv:2305.04091, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Qdmr-based planning-and-solving prompting for complex reasoning tasks,
J. Huang, Q. She, W. Jiang, H. Wu, Y. Hao, T. Xu, and F. Wu, “Qdmr-based planning-and-solving prompting for complex reasoning tasks,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 13 395–13 406
work page 2024
-
[45]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Qwen2.5-Coder Technical Report
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2.5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
M. AI, “Llama 3 8b,” https://huggingface.co/meta-llama/ Meta-Llama-3-8B, 2025, accessed: 2025-05-14
work page 2025
-
[48]
Microsoft phi-4 (14b parameter small language model),
Microsoft, “Microsoft phi-4 (14b parameter small language model),” https://huggingface.co/microsoft/phi-4, 2024, accessed: 2025-07-25
work page 2024
-
[49]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
S. Zheng, J. Wang, Y. Bai, S. Wu, Y. Du, X. Li, C. Xu, Y. Zhang, J. Ma, J. Linet al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Datacomp: In search of the next generation of multimodal datasets
X. Geng, A. Chen, E. Zhang, T. Hashimoto, D. Jurafsky, P. Liang, T. Zhanget al., “Koala: A dialogue model for academic research,” arXiv preprint arXiv:2304.14108, 2023
-
[51]
Decomposed prompt decision transformer for efficient unseen task generalization,
H. Zheng, L. Shen, Y. Luo, T. Liu, J. Shen, and D. Tao, “Decomposed prompt decision transformer for efficient unseen task generalization,”Advances in Neural Information Processing Systems, vol. 37, pp. 122 984–123 006, 2024
work page 2024
-
[52]
Decomposed Prompting: A Modular Approach for Solving Complex Tasks
T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,”arXiv preprint arXiv:2210.02406, 2022
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.