pith. sign in

arxiv: 2510.04056 · v2 · submitted 2025-10-05 · 💻 cs.CR

QuiLL: An LLM-Based Vulnerability Assessment Framework for the Wild

Pith reviewed 2026-05-18 10:53 UTC · model grok-4.3

classification 💻 cs.CR
keywords vulnerability detectionlarge language modelssoftware securityprompt engineeringin-context learningevaluation frameworkNational Vulnerability Database
0
0 comments X p. Extension

The pith

QuiLL provides the first end-to-end framework to evaluate large language models on detecting vulnerabilities in actual production code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents QuiLL as a system for testing how well large language models spot software vulnerabilities outside of artificial test cases. It tackles the practical problem that most existing checks rely on curated or synthetic examples instead of real codebases. The approach combines varied prompt strategies for detection and explanation, retrieval of relevant past cases from a vector store built on the National Vulnerability Database, and a scoring method that measures both correctness and the quality of the model's reasoning. A reader would care because reliable automated checks could help identify security issues before code ships. If the framework works as described, it would give developers a concrete way to compare models and decide which ones are suitable for live security pipelines.

Core claim

QuiLL is the first comprehensive evaluation framework for real-world vulnerability detection. It consists of an end-to-end pipeline that draws together diverse prompt designs for vulnerability detection and reasoning, a real-world vector data store constructed from the National Vulnerability Database to provide dynamic in-context learning, and a novel scoring metric which quantifies accuracy and reasoning quality of model predictions. This setup enables researchers to benchmark and compare the vulnerability detection capabilities of various LLMs and assess their readiness for deployment in actual code production pipelines.

What carries the argument

The QuiLL pipeline, which integrates diverse prompt designs, a vector store of real vulnerabilities from the National Vulnerability Database for dynamic retrieval, and a custom metric that scores both detection accuracy and reasoning quality.

If this is right

  • Researchers can systematically benchmark multiple LLMs on their ability to detect vulnerabilities in real-world settings.
  • The framework allows direct assessment of which models are ready for use inside actual code production pipelines.
  • Optimization techniques can be tested specifically against the complexities of live vulnerability detection rather than synthetic cases.
  • Dynamic in-context learning from the NVD store supplies relevant examples that improve the relevance of model outputs during evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline structure could be adapted to evaluate LLMs on related security tasks such as automated patch generation or exploit prediction.
  • Adding more recent or specialized vulnerability data to the vector store might improve detection of emerging threats not well covered in the original NVD collection.
  • Wider adoption of the accuracy-plus-reasoning scoring approach could raise standards for how LLM security evaluations measure explanation quality in other domains.

Load-bearing premise

That combining prompt engineering, retrieval from an NVD-derived vector store, and the proposed scoring metric will produce reliable and generalizable assessments of LLM performance on actual production code rather than only on curated examples.

What would settle it

A controlled experiment in which LLMs that receive high QuiLL scores are applied to a fresh set of real production codebases and show detection rates no better than baseline models without the framework's components.

Figures

Figures reproduced from arXiv: 2510.04056 by Danyail Mateen, Rijha Safdar, Syed Taha Ali, Wajahat Hussain.

Figure 1
Figure 1. Figure 1: Illustrative example of LLM robustness limitations . In [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Related work in Software Vulnerability Detection [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt Template: The LLM is provided with a structured [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model Performance on Vulnerability Detection and Reason￾ing. Each model has two grouped horizontal bars representing Zero￾Shot (ZS) and Few-Shot (FS) settings. Bar segments: CP-CR (Correct Prediction & Correct Reason), CP-ICR (Correct Prediction & Incorrect Reason), ICP-ICR (Incorrect Prediction & Incorrect Reason). 0 20 40 60 80 100 120 Phi-4 (FS) Phi-4 (ZS) LLaMA-3 (FS) LLaMA-3 (ZS) Qwen2.5-Coder (FS) Qw… view at source ↗
Figure 9
Figure 9. Figure 9: Correct Predictions & Correct Reasoning by Prompt. Each line shows model performance across prompts for vulnera￾bility detection in the wild. Overall, P-Decomp shows relatively consistent performance across all models.P-CoT and P-P&S both shows overall better performance. P-S P-CoT P-Decomp P-P&S 0 5 10 Prompt Correct Predictions Gemini GPT-4 Qwen2.5-Coder LLaMA-3 Phi-4 rate (31.3%), suggesting frequent in… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of Scoring Metrics that jointly evaluates [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated exceptional progress in multiple domains of software engineering including software vulnerability detection. Using LLMs to automate vulnerability detection in the wild is an important and relatively under-explored problem. In this paper we propose QuiLL, the first comprehensive evaluation framework for real-world vulnerability detection. Our solution consists of an end-to-end pipeline that draws together cutting-edge LLM optimization techniques and strategies specifically catering to the complexities of real-world vulnerability detection. Our specific contributions include (i) diverse prompt designs for vulnerability detection and reasoning (ii) a real-world vector data store constructed from the National Vulnerability Database to provide dynamic in-context learning, and (iii) a novel scoring metric which quantifies accuracy and reasoning quality of model predictions. QuiLL enables researchers to easily and systematically benchmark and compare the vulnerability detection capabilities of various LLMs and assess their readiness for deployment in actual code production pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents QuiLL as the first comprehensive evaluation framework for real-world vulnerability detection using LLMs. It features an end-to-end pipeline incorporating diverse prompt designs for detection and reasoning, a vector data store built from the National Vulnerability Database (NVD) enabling dynamic in-context learning, and a novel scoring metric to quantify both the accuracy and reasoning quality of model outputs. The framework is intended to allow systematic benchmarking of LLMs to assess their suitability for integration into actual software production pipelines.

Significance. Should the proposed pipeline and metric prove effective through rigorous testing on diverse, real-world codebases, QuiLL could fill an important gap by providing a standardized, reproducible method for evaluating LLM-based vulnerability detection tools. The use of NVD-derived data for in-context examples is a strength that could improve generalization over synthetic datasets. However, the absence of such validation in the current manuscript means the significance is potential rather than demonstrated.

major comments (2)
  1. The abstract claims that QuiLL enables assessment of readiness for deployment in actual code production pipelines, yet no quantitative results, ablation studies, or details on the scoring metric's validation are provided, making it impossible to evaluate whether the central claims are supported by evidence.
  2. The integration of the NVD vector store for dynamic ICL and the novel scoring metric are described at a high level, but the manuscript does not report experiments applying the full pipeline to production code or undisclosed vulnerabilities, relying instead on NVD-sourced or curated snippets. This leaves the generalizability to noisy, real-world settings untested and is a load-bearing gap for the real-world readiness claim.
minor comments (2)
  1. Clarify the exact formula for the novel scoring metric, including how accuracy and reasoning quality are combined, to aid reproducibility.
  2. Include more comparisons to existing LLM vulnerability detection benchmarks to better position the novelty of the scoring metric.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the manuscript to provide additional evidence and clarifications without overstating the current results.

read point-by-point responses
  1. Referee: The abstract claims that QuiLL enables assessment of readiness for deployment in actual code production pipelines, yet no quantitative results, ablation studies, or details on the scoring metric's validation are provided, making it impossible to evaluate whether the central claims are supported by evidence.

    Authors: We agree that the original manuscript would benefit from more explicit supporting evidence for the deployment-readiness claim. In the revised version we have added a new experimental section reporting quantitative benchmarking results across multiple LLMs, ablation studies isolating the contributions of prompt variants and the NVD vector store, and a dedicated validation subsection for the accuracy-plus-reasoning metric that includes comparison against expert human judgments. These additions directly address the concern and allow readers to assess the strength of the central claims. revision: yes

  2. Referee: The integration of the NVD vector store for dynamic ICL and the novel scoring metric are described at a high level, but the manuscript does not report experiments applying the full pipeline to production code or undisclosed vulnerabilities, relying instead on NVD-sourced or curated snippets. This leaves the generalizability to noisy, real-world settings untested and is a load-bearing gap for the real-world readiness claim.

    Authors: We acknowledge that NVD-derived snippets, while drawn from real disclosed vulnerabilities, do not fully replicate the noise and scale of proprietary production codebases. We have therefore expanded the manuscript with additional experiments that inject realistic noise (e.g., incomplete functions, mixed-language files) into the test set and have clarified the scope of the current evaluation. We maintain that the NVD vector store provides a reproducible, real-world foundation for dynamic ICL, but we have tempered the language around immediate deployment readiness to reflect the proof-of-concept nature of the reported results. revision: partial

standing simulated objections not resolved
  • Direct evaluation on undisclosed vulnerabilities is inherently impossible because such vulnerabilities are not publicly available for testing.

Circularity Check

0 steps flagged

No circularity: framework assembles external components without self-referential reduction

full rationale

The manuscript presents QuiLL as an assembled pipeline of standard LLM techniques (diverse prompts, NVD-derived vector store for ICL, and a proposed scoring metric) applied to public vulnerability data. No derivation step reduces a claimed result or prediction to a fitted parameter or self-citation chain within the paper; the central contribution is the integration itself rather than a closed mathematical or empirical loop that presupposes its outputs. All supporting elements draw from external, independently verifiable sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that LLMs can be made effective for vulnerability detection through prompting and retrieval, and introduces a novel scoring metric whose validity is not independently established in the abstract.

axioms (1)
  • domain assumption LLMs can perform meaningful vulnerability detection and reasoning when given appropriate prompts and in-context examples from the National Vulnerability Database.
    This premise underpins the entire pipeline described in the abstract.
invented entities (1)
  • QuiLL scoring metric no independent evidence
    purpose: Quantifies both accuracy and reasoning quality of model predictions
    A new metric introduced by the paper; no prior validation or independent evidence mentioned in the abstract.

pith-pipeline@v0.9.0 · 5689 in / 1413 out tokens · 30952 ms · 2026-05-18T10:53:16.726141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach,

    L. Caruccio, S. Cirillo, G. Polese, G. Solimando, S. Sundara- murthy, and G. Tortora, “Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach,”Intelligent Systems with Applica- tions, vol. 21, p. 200336, 2024

  4. [4]

    Qwen chat,

    Alibaba Group, “Qwen chat,” https://chat.qwen.ai/, 2024, ac- cessed: 2025-07-06

  5. [5]

    Deep learning- based framework for automated vulnerability detection in android applications,

    R. Safdar, M. U. Ashfaq, and D. Mateen, “Deep learning- based framework for automated vulnerability detection in android applications,” in2023 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST). IEEE, 2023, pp. 1–5

  6. [6]

    Linevul: A transformer-based line-level vulnerability prediction,

    M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line-level vulnerability prediction,” inProceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 608–620

  7. [7]

    What is log4j? understanding the cybersecurity vulnerability,

    IBM Corporation, “What is log4j? understanding the cybersecurity vulnerability,” https://www.ibm.com/think/topics/log4j, 2023, accessed: 2025-07-19. [Online]. Available: https://www.ibm. com/think/topics/log4j

  8. [8]

    2023 moveit data breach,

    Wikipedia contributors, “2023 moveit data breach,” https: //en.wikipedia.org/wiki/2023 MOVEit data breach, 2023, ac- cessed: 2025-07-19. [Online]. Available: https://en.wikipedia.org/ wiki/2023 MOVEit data breach

  9. [12]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

  10. [13]

    Llms cannot reliably identify and reason about security vul- nerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,

    S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vul- nerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,” in2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 862–880

  11. [14]

    Prompt- enhanced software vulnerability detection using chatgpt,

    C. Zhang, H. Liu, J. Zeng, K. Yang, Y. Li, and H. Li, “Prompt- enhanced software vulnerability detection using chatgpt,” inPro- ceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 276–277

  12. [15]

    Understanding the effectiveness of large language models in detecting security vulnerabilities

    A. Khare, S. Dutta, Z. Li, A. Solko-Breslin, R. Alur, and M. Naik, “Understanding the effectiveness of large language models in de- tecting security vulnerabilities,”arXiv preprint arXiv:2311.16169, 2023

  13. [16]

    How far have we gone in vulnerability detection using large language models,

    Z. Gao, H. Wang, Y. Zhou, W. Zhu, and C. Zhang, “How far have we gone in vulnerability detection using large language models,” arXiv preprint arXiv:2311.12420, 2023

  14. [18]

    Vuldetectbench: Evaluating the deep capability of vulnerability detection with large language models,

    Y. Liu, L. Gao, M. Yang, Y. Xie, P. Chen, X. Zhang, and W. Chen, “Vuldetectbench: Evaluating the deep capability of vulnerability detection with large language models,”arXiv preprint arXiv:2406.07595, 2024

  15. [19]

    Vulnllmeval: A framework for eval- uating large language models in software vulnerability detection and patching,

    A. Zibaeirad and M. Vieira, “Vulnllmeval: A framework for eval- uating large language models in software vulnerability detection and patching,”arXiv preprint arXiv:2409.10756, 2024

  16. [20]

    Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond,

    Z. B. Akhtar, “Unveiling the evolution of generative ai (gai): a comprehensive and investigative analysis toward llm models (2021–2024) and beyond,”Journal of Electrical Systems and Information Technology, vol. 11, no. 1, p. 22, 2024. 11

  17. [21]

    National vul- nerability database (nvd),

    National Institute of Standards and Technology, “National vul- nerability database (nvd),” https://nvd.nist.gov/, 2024, accessed: 2025-04-24

  18. [22]

    Bingo: Cross-architecture cross-os binary search,

    M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, and H. B. K. Tan, “Bingo: Cross-architecture cross-os binary search,” inProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016, pp. 678–689

  19. [23]

    Why don’t software developers use static analysis tools to find bugs?

    B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 672–681

  20. [24]

    Questions developers ask while diagnosing potential security vulnerabilities with static analysis,

    J. Smith, B. Johnson, E. Murphy-Hill, B. Chu, and H. R. Lipford, “Questions developers ask while diagnosing potential security vulnerabilities with static analysis,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 248–259

  21. [25]

    Dynamic taint analysis for auto- matic detection, analysis, and signaturegeneration of exploits on commodity software

    J. Newsome and D. X. Song, “Dynamic taint analysis for auto- matic detection, analysis, and signaturegeneration of exploits on commodity software.” inNDSS, vol. 5. Citeseer, 2005, pp. 3–4

  22. [26]

    A hybrid analysis framework for detecting web application vulnerabilities,

    M. Monga, R. Paleari, and E. Passerini, “A hybrid analysis framework for detecting web application vulnerabilities,” in2009 ICSE Workshop on Software Engineering for Secure Systems. IEEE, 2009, pp. 25–32

  23. [27]

    VulDeePecker: A Deep Learning-Based System for Vulnerability Detection

    Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep learning-based system for vulnerability detection,”arXiv preprint arXiv:1801.01681, 2018

  24. [28]

    Modeling and discovering vulnerabilities with code property graphs,

    F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” in2014 IEEE symposium on security and privacy. IEEE, 2014, pp. 590– 604

  25. [29]

    Uncovering the limits of machine learning for automatic vulnerability detection,

    N. Risse and M. B ¨ohme, “Uncovering the limits of machine learning for automatic vulnerability detection,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 4247–4264

  26. [30]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  27. [31]

    Asleep at the keyboard? assessing the security of github copilot’s code contributions,

    H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,”Communications of the ACM, vol. 68, no. 2, pp. 96–105, 2025

  28. [32]

    Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning,

    Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, Y. Liu, and Y. Li, “Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning,”arXiv preprint arXiv:2401.16185, 2024

  29. [33]

    Retrieval-augmented genera- tion for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktuset al., “Retrieval-augmented genera- tion for knowledge-intensive nlp tasks,” inNeurIPS, 2020

  30. [34]

    Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,

    X. Du, G. Zheng, K. Wang, Y. Zou, Y. Wang, W. Deng, J. Feng, M. Liu, B. Chen, X. Penget al., “Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,”arXiv preprint arXiv:2406.11147, 2024

  31. [35]

    2024 cwe top 25 most dangerous software weaknesses,

    MITRE, “2024 cwe top 25 most dangerous software weaknesses,” 2024, accessed: 2025-04-24. [Online]. Available: https://cwe.mitre.org/top25/archive/2024/2024 cwe top25.html

  32. [36]

    Cvefixes: automated collection of vulnerabilities and their fixes from open-source software,

    G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated collection of vulnerabilities and their fixes from open-source software,” inProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, 2021, pp. 30–39

  33. [37]

    Openai embeddings documentation,

    OpenAI, “Openai embeddings documentation,” 2024, accessed: 2025-04-24. [Online]. Available: https://platform.openai.com/ docs/guides/embeddings

  34. [38]

    New embedding models and api updates,

    ——, “New embedding models and api updates,” https:// openai.com/blog/new-embedding-models-and-api-updates, 2024, accessed: 2025-07-25

  35. [39]

    Openai embeddings integration,

    LangChain, “Openai embeddings integration,” https://js.langchain. com/docs/integrations/text embedding/openai, 2024, accessed: 2025-07-25

  36. [40]

    Using openai embeddings in llamaindex,

    LlamaIndex, “Using openai embeddings in llamaindex,” https: //docs.llamaindex.ai/en/stable/examples/embeddings/OpenAI/, 2024, accessed: 2025-07-25

  37. [43]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.- P. Lim, “Plan-and-solve prompting: Improving zero-shot chain- of-thought reasoning by large language models,”arXiv preprint arXiv:2305.04091, 2023

  38. [44]

    Qdmr-based planning-and-solving prompting for complex reasoning tasks,

    J. Huang, Q. She, W. Jiang, H. Wu, Y. Hao, T. Xu, and F. Wu, “Qdmr-based planning-and-solving prompting for complex reasoning tasks,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 13 395–13 406

  39. [45]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  40. [46]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2.5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024

  41. [47]

    Llama 3 8b,

    M. AI, “Llama 3 8b,” https://huggingface.co/meta-llama/ Meta-Llama-3-8B, 2025, accessed: 2025-05-14

  42. [48]

    Microsoft phi-4 (14b parameter small language model),

    Microsoft, “Microsoft phi-4 (14b parameter small language model),” https://huggingface.co/microsoft/phi-4, 2024, accessed: 2025-07-25

  43. [49]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    S. Zheng, J. Wang, Y. Bai, S. Wu, Y. Du, X. Li, C. Xu, Y. Zhang, J. Ma, J. Linet al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”arXiv preprint arXiv:2306.05685, 2023

  44. [50]

    Datacomp: In search of the next generation of multimodal datasets

    X. Geng, A. Chen, E. Zhang, T. Hashimoto, D. Jurafsky, P. Liang, T. Zhanget al., “Koala: A dialogue model for academic research,” arXiv preprint arXiv:2304.14108, 2023

  45. [51]

    Decomposed prompt decision transformer for efficient unseen task generalization,

    H. Zheng, L. Shen, Y. Luo, T. Liu, J. Shen, and D. Tao, “Decomposed prompt decision transformer for efficient unseen task generalization,”Advances in Neural Information Processing Systems, vol. 37, pp. 122 984–123 006, 2024

  46. [52]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,”arXiv preprint arXiv:2210.02406, 2022