pith. sign in

arxiv: 2606.04739 · v1 · pith:TYGPHL5Lnew · submitted 2026-06-03 · 💻 cs.SE · cs.AI

Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models

Pith reviewed 2026-06-28 05:25 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords vulnerability detectionRAGLLMreproducibilityopen-weight modelspairwise accuracysoftware security
0
0 comments X

The pith

Vul-RAG results reproduce locally but performance plateaus at 0.30 pairwise accuracy even with newer open-weight models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates Vul-RAG, a RAG-based system that augments LLMs with vulnerability knowledge to classify source code functions as vulnerable or patched. It verifies that the original findings hold when running the system entirely with local open-weight models rather than proprietary APIs, although small differences appear. Across code-specialized, general-purpose, and reasoning models of different sizes, accuracy on pairs where both the vulnerable and patched versions are correctly identified stays near 0.30 and does not rise with more recent or larger models. This indicates that gains in raw model capacity alone do not overcome the limits of the current RAG setup for this task.

Core claim

The study shows that Vul-RAG findings are reproducible under local open-weights deployment with only minor deviations from the original, yet a performance plateau at approximately 0.30 pairwise accuracy holds across every evaluated model, including recent advanced ones, which demonstrates that improvements in model capacity by themselves do not substantially raise detection performance.

What carries the argument

The pairwise accuracy metric on code pairs (correct classification of both the vulnerable function and its patched counterpart) that exposes the persistent performance ceiling.

If this is right

  • The original Vul-RAG results generalize to fully local open-weight deployments.
  • Neither model size nor recency breaks the 0.30 pairwise accuracy limit.
  • Detection effectiveness, model capabilities, and model scale involve concrete trade-offs that practitioners must weigh.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The plateau suggests that further progress may require changes to the retrieval component or the injected vulnerability knowledge rather than larger base models.
  • Open-weight models appear sufficient for reproducing this class of RAG vulnerability detector without reliance on closed APIs.
  • The same evaluation setup could be used to test whether fine-tuning or different knowledge sources can lift performance above the current ceiling.

Load-bearing premise

The chosen collection of open-weight models adequately represents the broader space of models so the observed plateau is not an artifact of those specific selections.

What would settle it

An open-weight model achieving clearly higher pairwise accuracy, for example above 0.45, on the same dataset and task would contradict the plateau claim.

Figures

Figures reproduced from arXiv: 2606.04739 by Fabian Schmidt, Sabrina Kaniewski, Tobias Heer.

Figure 1
Figure 1. Figure 1: Overview of Vul-RAG, adapted from [5]. The framework consists of three stages: 1 offline vulnerability knowledge base construction; 2 online retrieval of relevant knowledge items based on functional semantics; and 3 iterative, knowledge-augmented vulnerability detection. The markers *P1 and *P2 highlight points of failure where model performance frequently failed, as discussed in Sections 4.2 and 5.1 [PIT… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of tokens for the detection prompts across the PairVul CWE subsets, calculated using the DeepSeek-Coder-V2-Instruct tokenizer. Red crosses indicate dataset items where the model failed to generate a detection output. The right side shows the distribution of input code lengths of PairVul in tokens [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of output lengths in tokens between baseline and reasoning models for max_new_tokens=4096. Reasoning models exhibit higher token counts and longer tails, reflecting the generation of reasoning steps. 1 2 3 0 25 50 75 100 82% 36% 100% 76% 40% 100% CWE-264 1 2 3 70% 50% 100% 86% 42% 100% CWE-119 1 2 3 70% 67% 100% 79% 53% 100% CWE-125 1 2 3 54% 67% 100% 71% 70% 100% CWE-200 1 2 3 0 25 50 75 100 80… view at source ↗
Figure 4
Figure 4. Figure 4: Conditional success rates for reasoning vs. baseline model. Values represent the proportion of samples classified at a specific iteration, provided no decision was reached earlier. Vul-RAG iterates through up to three items before enforcing a final decision [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a reproducibility study of Vul-RAG, a RAG-based framework that augments LLMs with high-level vulnerability knowledge for source-code vulnerability detection. The authors first replicate the original results using the reported open-weight baseline models in a fully local setting, then extend the evaluation to a broader cohort of recent open-weight LLMs (code-specialized, general-purpose, and reasoning models of varying sizes). They report that the original findings are reproducible with only minor deviations and that a performance plateau of approximately 0.30 pairwise accuracy (correct classification of both vulnerable and patched functions in a pair) persists across all tested models, including newer and larger ones, leading to the conclusion that improvements in model capacity alone do not substantially enhance performance. Implementation artifacts are released publicly.

Significance. If the reported plateau is robust, the work has clear significance for the software-security and LLM-for-code communities: it supplies concrete evidence that simply scaling open-weight models is unlikely to overcome current limits in RAG-based vulnerability detection and therefore directs attention toward other levers such as retrieval quality, knowledge representation, or task-specific adaptation. The public release of code and evaluation artifacts is a concrete strength that directly supports replicability claims in the field.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): the central claim that the ~0.30 pairwise-accuracy plateau 'persists even for more recent and advanced models' and therefore that 'improvements in model capacity alone do not substantially enhance performance' is load-bearing for the paper's main conclusion, yet the manuscript provides no enumeration of the concrete models, their parameter counts, training regimes, or a scaling plot. Without this information it is impossible to verify that capacity was varied over a range large enough for an effect to appear if one existed, undermining the representativeness argument.
  2. [§4.2] §4.2 (Model selection): the description of the cohort as 'code-specialized, general-purpose, and reasoning models of varying parameter sizes' is too coarse to rule out systematic bias toward decoder-only transformers trained on overlapping code corpora; a table listing every model, size, source, and any fine-tuning details is required to support the generalization.
minor comments (2)
  1. [Results] The definition of pairwise accuracy is given in the abstract but should be restated verbatim in the results section and any tables that report it, to avoid ambiguity for readers who start with the figures.
  2. [Artifacts] The GitHub link is provided; confirm that the released artifacts include the exact dataset splits, prompts, and statistical-test scripts used for the replication so that future replicators can match the reported minor deviations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript accordingly to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): the central claim that the ~0.30 pairwise-accuracy plateau 'persists even for more recent and advanced models' and therefore that 'improvements in model capacity alone do not substantially enhance performance' is load-bearing for the paper's main conclusion, yet the manuscript provides no enumeration of the concrete models, their parameter counts, training regimes, or a scaling plot. Without this information it is impossible to verify that capacity was varied over a range large enough for an effect to appear if one existed, undermining the representativeness argument.

    Authors: We agree that a detailed listing of models and a visualization of performance across scales would strengthen the manuscript. In the revised version, we will add a table in §4.2 enumerating all evaluated models, including their parameter counts, architectures, sources, and training/fine-tuning details where publicly available. We will also include a scaling plot in §4 showing pairwise accuracy as a function of model size (in parameters) to demonstrate the observed plateau across the tested range. This addresses the concern about the representativeness of the capacity variation. revision: yes

  2. Referee: [§4.2] §4.2 (Model selection): the description of the cohort as 'code-specialized, general-purpose, and reasoning models of varying parameter sizes' is too coarse to rule out systematic bias toward decoder-only transformers trained on overlapping code corpora; a table listing every model, size, source, and any fine-tuning details is required to support the generalization.

    Authors: We acknowledge that the current high-level categorization may not sufficiently address potential biases. We will revise §4.2 to include a detailed table with every model, its size (parameter count), source (e.g., specific Hugging Face model identifiers), and any known fine-tuning or specialization information. This will enable readers to evaluate the diversity of the cohort and assess the generalizability of the performance plateau finding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical replication of existing results with no derivations or self-referential quantities

full rationale

The paper is a reproducibility study that executes Vul-RAG on open-weight models, reports pairwise accuracy numbers, and observes an empirical plateau at ~0.30. No equations, fitted parameters, predictions derived from internal definitions, or self-citation chains appear in the abstract or described methodology. All reported quantities are direct experimental outputs from running models on the same benchmark; none reduce to quantities defined inside the paper itself. The generalization concern about model representativeness is a question of external validity, not circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical replication study. No free parameters are introduced. The central claim rests on the domain assumption that a local open-weights replication can meaningfully test the generalizability of the original proprietary-model results.

axioms (1)
  • domain assumption A local open-weights replication using the reported baseline models faithfully captures the core claims of the original Vul-RAG framework.
    The study treats successful reproduction under local conditions as evidence that the original findings are not artifacts of proprietary APIs.

pith-pipeline@v0.9.1-grok · 5793 in / 1449 out tokens · 38734 ms · 2026-06-28T05:25:03.638374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 9 canonical work pages

  1. [1]

    arXiv:2510.25506 (2025)

    Angermeir, F., Amougou, M., Kreitz, M., Bauer, A., Linhuber, M., Fucci, D., et al.: Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies. arXiv:2510.25506 (2025)

  2. [2]

    IEEE Access14, 50878–50891 (2026).https://doi

    Antal, G., Értekes, L., Szolnoki, N., Hegedűs, P.: Evaluating Retrieval-Augmented Generation for LLM-Based Vulnerability Detection: An Empirical Study on Real- World Java Vulnerabilities. IEEE Access14, 50878–50891 (2026).https://doi. org/10.1109/ACCESS.2026.3676577

  3. [3]

    IEEE Transactions on Software Engineering50(8), 2163–2177 (2024)

    Chakraborty, P., Arumugam, K.K., Alfadel, M., Nagappan, M., McIntosh, S.: Revisiting the performance of deep learning-based vulnerability detection on realistic datasets. IEEE Transactions on Software Engineering50(8), 2163–2177 (2024). https://doi.org/10.1109/TSE.2024.3423712

  4. [4]

    arXiv:2406.11931 (2024)

    DeepSeek-AI, Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., et al.: DeepSeek- Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv:2406.11931 (2024)

  5. [5]

    ACM Transactions on Software Engineering and Methodology (TOSEM) (2026).https: //doi.org/10.1145/3797277, Just Accepted

    Du, X., Zheng, G., Wang, K., Zou, Y., Wang, Y., Deng, W., et al.: Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. ACM Transactions on Software Engineering and Methodology (TOSEM) (2026).https: //doi.org/10.1145/3797277, Just Accepted

  6. [6]

    Nature645(8081), 633–638 (2025)

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., et al.: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature645(8081), 633–638 (2025)

  7. [7]

    arXiv:2409.12186 (2024)

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., et al.: Qwen2.5-Coder Technical Report. arXiv:2409.12186 (2024)

  8. [8]

    In: Conference on AI, Science, Engineering, and Technology (AIxSET)

    Kaniewski, S., Holstein, D., Schmidt, F., Heer, T.: Vulnerability Handling of AI- Generated Code - Existing Solutions and Open Challenges. In: Conference on AI, Science, Engineering, and Technology (AIxSET). pp. 145–148. Laguna Hills, CA, USA (2024).https://doi.org/10.1109/AIxSET62544.2024.00026 18 S. Kaniewski et al

  9. [9]

    ACM Transactions on Software Engineering and Methodology (TOSEM) (2026).https://doi.org/10.1145/3815425, Just Accepted

    Kaniewski, S., Schmidt, F., Enzweiler, M., Menth, M., Heer, T.: A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models. ACM Transactions on Software Engineering and Methodology (TOSEM) (2026).https://doi.org/10.1145/3815425, Just Accepted

  10. [10]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33, pp. 9459–9474 (2020)

  11. [11]

    Journal of Systems and Software (JSS)212, 112031 (2024)

    Lu, G., Ju, X., Chen, X., Pei, W., Cai, Z.: GRACE: Empowering LLM-based Software Vulnerability Detection with Graph Structure and In-Context Learning. Journal of Systems and Software (JSS)212, 112031 (2024)

  12. [12]

    https://cwe.mitre.org/data/ definitions/1000.html(2026), accessed 2026-03-06

    MITRE: CWE-1000: Research Concepts. https://cwe.mitre.org/data/ definitions/1000.html(2026), accessed 2026-03-06

  13. [13]

    IEEE Transactions on Software Engineering49(4), 1983–2005 (2023).https://doi.org/ 10.1109/TSE.2022.3207149

    Nong, Y., Sharma, R., Hamou-Lhadj, A., Luo, X., Cai, H.: Open Science in Software Engineering: A Study on Deep Learning-Based Vulnerability Detection. IEEE Transactions on Software Engineering49(4), 1983–2005 (2023).https://doi.org/ 10.1109/TSE.2022.3207149

  14. [14]

    Qwen: Qwen2.5-LLM: Extending the boundary of LLMs.https://qwen.ai/blog? id=qwen2.5-llm, accessed 2026-03-04

  15. [15]

    ai/blog?id=qwq-32b, accessed 2026-03-04

    Qwen: QwQ-32B: Embracing the Power of Reinforcement Learning.https://qwen. ai/blog?id=qwq-32b, accessed 2026-03-04

  16. [16]

    arXiv:2510.04056 (2025)

    Safdar, R., Mateen, D., Ali, S.T., Hussain, W.: Real-VulLLM: An LLM Based Assessment Framework in the Wild. arXiv:2510.04056 (2025)

  17. [17]

    Sallou, J., Durieux, T., Panichella, A.: Breaking the Silence: the Threats of Using LLMsinSoftwareEngineering.In:ACM/IEEEInternationalConferenceonSoftware Engineering: New Ideas and Emerging Results (ICSE-NIER). p. 102–106. Lisbon, Portugal (2024).https://doi.org/10.1145/3639476.3639764

  18. [18]

    Does data sampling improve deep learning-based vulnerability detection? yeas! and nays!

    Steenhoek, B., Rahman, M.M., Jiles, R., Le, W.: An Empirical Study of Deep Learning Models for Vulnerability Detection. In: IEEE/ACM International Confer- ence on Software Engineering (ICSE). pp. 2237–2248. Melbourne, Australia (2023). https://doi.org/10.1109/ICSE48619.2023.00188

  19. [19]

    arXiv:2401.16185 (2025)

    Sun, Y., Wu, D., Xue, Y., Liu, H., Ma, W., Zhang, L., et al.: LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning. arXiv:2401.16185 (2025)

  20. [20]

    In: Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Tsai, C.N., Wang, X., Lee, C.H., Lin, C.S.: A Sequential Multi-Stage Approach for Code Vulnerability Detection via Confidence- and Collaboration-based Decision Making. In: Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 21162–21168. Suzhou, China (2025).https://doi.org/10.18653/ v1/2025.emnlp-main.1071

  21. [21]

    Cluster Computing28(15), 1–12 (2025).https://doi.org/10.1007/s10586-025-05721-2

    Tsai, C.N., Xie, J., Lai, C.M., Lin, C.S.: Leveraging Intra-and Inter-References in Vulnerability Detection using Multi-Agent Collaboration Based on LLMs. Cluster Computing28(15), 1–12 (2025).https://doi.org/10.1007/s10586-025-05721-2

  22. [22]

    arXiv:2404.15596 (2024)

    Wen, X.C., Wang, X., Chen, Y., Hu, R., Lo, D., Gao, C.: VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection. arXiv:2404.15596 (2024)

  23. [23]

    arXiv:2601.18847 (2026)

    Wu, Z., Xu, J., Peng, Y., Chong, C.Y., Jia, X.: MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution. arXiv:2601.18847 (2026)

  24. [24]

    arXiv:2505.09388 (2025)

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., et al.: Qwen3 Technical Report. arXiv:2505.09388 (2025)

  25. [25]

    arXiv:2511.04014 (2025)

    Zhu, H., Li, J., Gao, C., Qian, J., Dong, Y., Liu, H., et al.: Specification-Guided Vulnerability Detection with Large Language Models. arXiv:2511.04014 (2025)