RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

Achyuta Muthuvelan; Asini Subanya; Boubacar Ballo; Boyuan Chen; Eleanna Kafeza; Hakim Hacid; Kashish Satija; Mariam Shafey; Minghao Shao; Mohamed Mahmoud

arxiv: 2604.17948 · v2 · pith:OZCCORUAnew · submitted 2026-04-20 · 💻 cs.CR · cs.AI· cs.MA

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

Parteek Jamwal , Minghao Shao , Boyuan Chen , Achyuta Muthuvelan , Asini Subanya , Boubacar Ballo , Kashish Satija , Mariam Shafey

show 10 more authors

Mohamed Mahmoud Moncif Dahaji Bouffi Pasindu Wickramasinghe Siyona Goel Yaakulya Sabbani Hakim Hacid Mthandazo Ndhlovu Eleanna Kafeza Sanjay Rawat Muhammad Shafique

This is my paper

Pith reviewed 2026-05-10 04:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA

keywords vulnerability report generationLLM agentsretrieval augmented generationautomated security analysisNIST-SARD datasetCWE classificationcybersecurity documentation

0 comments

The pith

RAVEN uses LLM agents and retrieval to generate structured vulnerability reports from code samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RAVEN as a multi-agent framework that combines large language models with retrieval-augmented generation to create detailed vulnerability analysis reports. It processes vulnerable source code by identifying issues, retrieving relevant knowledge from Project Zero reports and CWE entries, assessing impact, and producing outputs in a standardized template. Evaluation on 105 samples from the NIST-SARD dataset across 15 CWE types yields an average quality score of 54.21 percent according to a dedicated LLM judge. A sympathetic reader would care because manual creation of such reports is labor-intensive, and partial automation could accelerate security reviews during development and auditing. The work focuses on documentation quality rather than new detection methods.

Core claim

RAVEN deploys four coordinated modules—an Explorer agent for vulnerability identification, a RAG engine drawing from curated databases of Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured output—followed by an LLM-based judge that scores reports on structural integrity, ground truth alignment, code reasoning quality, and remediation quality. Tested on 105 vulnerable code samples covering 15 CWE types, the system records an average quality score of 54.21 percent.

What carries the argument

The RAVEN multi-agent pipeline (Explorer, RAG engine, Analyst, Reporter) plus task-specific LLM Judge that evaluates generated reports against fixed criteria.

If this is right

The framework can produce reports that follow an established Project Zero template for consistency across samples.
It covers a range of 15 CWE types drawn from a public NIST dataset of vulnerable code.
Quality evaluation breaks down into four measurable dimensions that can be tracked separately.
The approach separates identification, analysis, and documentation steps into distinct agents.
Results provide quantitative support for using retrieval to ground LLM outputs in known vulnerability knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent-plus-retrieval structure could be tested on binary executables to address the memory-corruption focus in the title.
Replacing or supplementing the LLM judge with human raters would provide a direct check on whether the 54 percent score reflects actual utility.
Integration of RAVEN outputs into existing bug-tracking systems could reduce the time from discovery to documented report.
Extending the RAG database beyond Project Zero and CWE entries might improve scores on less common vulnerability classes.

Load-bearing premise

An LLM judge can accurately score report quality on structure, alignment, reasoning, and remediation without human validation or comparison baselines.

What would settle it

Human security experts independently rating the same 105 reports and obtaining average scores materially below 54.21 percent.

Figures

Figures reproduced from arXiv: 2604.17948 by Achyuta Muthuvelan, Asini Subanya, Boubacar Ballo, Boyuan Chen, Eleanna Kafeza, Hakim Hacid, Kashish Satija, Mariam Shafey, Minghao Shao, Mohamed Mahmoud, Moncif Dahaji Bouffi, Mthandazo Ndhlovu, Muhammad Shafique, Parteek Jamwal, Pasindu Wickramasinghe, Sanjay Rawat, Siyona Goel, Yaakulya Sabbani.

**Figure 1.** Figure 1: Architectural Overview of RAVEN. (1) a Data Collection Pipeline that transforms raw web pages and PDFs into a structured format (2) a RAG Engine that indexes the structured data into a vector database for retrieval (3) an Agentic System that takes in the vulnerable code snippet and generates a comprehensive, Google Project Zero Style Vulnerability Report. the synthesis of such reports using LLMs introduces… view at source ↗

**Figure 3.** Figure 3: Overview of RAVEN’s RAG Engine. It comprises [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The RAVEN Agentic Pipeline. It orchestrates four specialized agents for end-to-end vulnerability analysis: (1) the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Box Plot Analysis of Overall Scores for all Falcon Models [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Falcon H1R-7B Individual Dimension Scores across all RAG Configurations (A [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: CWE Remediation Fix Analysis Plot Upon examining the Judge Agent logs for Falcon H1R-7B and Falcon H1-34B-Instruct, we can observe that • Falcon H1R-7B: This model consistently generates syntactically correct fixes; however, these fixes often prevent program failure by disabling or bypassing the problematic behavior rather than directly addressing and correcting the underlying cause • Falcon H1-34B-Instr… view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAVEN sets up four LLM agents plus RAG to produce Project Zero-style reports on vulnerable code, but the 54% LLM-judge score is hard to read without baselines or human checks.

read the letter

The paper builds a pipeline with an Explorer agent to spot issues in code, a RAG module that pulls from Project Zero reports and CWE entries, an Analyst to judge impact and exploitability, and a Reporter to format the final output. They run it on 105 NIST-SARD samples across 15 CWE types and report an average 54.21% quality score from their own LLM judge on structure, ground-truth match, reasoning, and remediation suggestions. That specific four-agent setup for full report synthesis is the clearest new piece; most prior LLM security work stops at detection or patching rather than end-to-end documentation in this template. The modular breakdown is practical and matches how a human analyst might work through a case. The choice of real datasets and external knowledge sources is also a plus for grounding. The evaluation is the soft spot. A single LLM judge score without any baseline (plain zero-shot generation, for example), without ablations on the RAG or agent components, and without even a small human re-scoring leaves the number floating. We cannot tell whether 54% is good, mediocre, or better than simpler methods, and any bias in the judge directly affects the headline result. No error analysis or inter-rater numbers are described either. This work is aimed at people already experimenting with LLM agents for security tooling. A reader who wants concrete examples of how to chain retrieval and agents for report generation could pick up useful design patterns, even if the performance claims need more support. It deserves peer review. The system is implemented and tested on a public dataset, so referees can ask for the missing comparisons and human validation rather than rejecting outright.

Referee Report

2 major / 2 minor

Summary. The paper introduces RAVEN, a multi-agent LLM framework with RAG modules (Explorer, RAG engine drawing from Project Zero and CWE databases, Analyst, Reporter) that takes vulnerable source code and produces structured vulnerability reports following the Google Project Zero Root Cause Analysis template. It evaluates the system on 105 NIST-SARD samples spanning 15 CWE types using a task-specific LLM judge that scores reports on structural integrity, ground truth alignment, code reasoning, and remediation, reporting an average quality score of 54.21%.

Significance. If the evaluation methodology were strengthened with human validation and baselines, the work could provide a useful template for automated vulnerability documentation pipelines. As presented, the headline metric does not yet constitute strong evidence of effectiveness because it lacks calibration against human experts or simpler non-RAG baselines.

major comments (2)

[Abstract] Abstract and evaluation description: the effectiveness claim is supported solely by a 54.21% average LLM-judge score on 105 samples, yet the manuscript provides no baselines (e.g., zero-shot LLM generation without the Explorer/Analyst/RAG components), no statistical significance tests, no inter-rater agreement for the judge, and no human re-scoring of any subset. This renders the numerical result uninterpretable as evidence that RAVEN improves report quality.
[Evaluation] Evaluation protocol (LLM Judge section): the judge itself is an LLM from the same model family as the generation agents and is never calibrated against human experts or compared to ground-truth human-written reports. Any systematic bias in the judge directly affects the reported 54.21% score, and the paper contains no ablation showing that the RAG or multi-agent modules raise judge scores above a plain LLM baseline.

minor comments (2)

[Title] Title vs. content mismatch: the title emphasizes 'Memory Corruption Analysis in User Code and Binary Programs', yet the abstract, dataset (NIST-SARD source-code samples), and reported experiments address general vulnerability report generation without reference to binary analysis or memory-corruption-specific techniques.
[Framework Description] Notation and module descriptions: the four-module architecture is described at a high level; a diagram or pseudocode showing the exact information flow between Explorer, RAG engine, Analyst, and Reporter would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the current evaluation lacks baselines and human calibration, which limits interpretability of the 54.21% score. Below we respond point-by-point and describe the revisions we will make to strengthen the evidence.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the effectiveness claim is supported solely by a 54.21% average LLM-judge score on 105 samples, yet the manuscript provides no baselines (e.g., zero-shot LLM generation without the Explorer/Analyst/RAG components), no statistical significance tests, no inter-rater agreement for the judge, and no human re-scoring of any subset. This renders the numerical result uninterpretable as evidence that RAVEN improves report quality.

Authors: We acknowledge this limitation. The reported score reflects the LLM judge's assessment against the defined criteria (structural integrity, ground-truth alignment, code reasoning, and remediation), but without baselines it is difficult to attribute gains to the multi-agent RAG design. In the revised manuscript we will add a zero-shot LLM baseline (same model family, no Explorer/Analyst/RAG) and report the delta in judge scores. We will also include statistical significance testing (e.g., paired t-tests or Wilcoxon tests) on the per-sample scores. For human validation we will re-score a random subset of 20 reports with two independent human experts and compute inter-rater agreement (Cohen's kappa) between humans and between humans and the LLM judge; we will report these results and any discrepancies. The abstract will be updated to state that the 54.21% figure is an LLM-judge score and to note the new baseline comparison. revision: yes
Referee: [Evaluation] Evaluation protocol (LLM Judge section): the judge itself is an LLM from the same model family as the generation agents and is never calibrated against human experts or compared to ground-truth human-written reports. Any systematic bias in the judge directly affects the reported 54.21% score, and the paper contains no ablation showing that the RAG or multi-agent modules raise judge scores above a plain LLM baseline.

Authors: We agree that using an LLM judge from the same family introduces potential bias and that an ablation is necessary. We will add an explicit ablation section comparing (1) plain zero-shot generation, (2) single-agent generation without RAG, and (3) full RAVEN, all evaluated by the same judge. To address calibration we will conduct a human study on the 20-report subset mentioned above, comparing human-assigned quality scores to the LLM-judge scores and reporting correlation and mean absolute difference. While we cannot retroactively change the judge model family without re-running all experiments, we will discuss the bias risk explicitly and note that the structured rubric (with explicit rubrics for each dimension) was designed to reduce subjectivity. These additions will be included in the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external evaluation dataset and no derivations or self-referential reductions

full rationale

The paper presents RAVEN as an LLM-agent + RAG framework for generating vulnerability reports following a Google Project Zero template, evaluated on 105 NIST-SARD samples across 15 CWE types. The reported 54.21% average quality score comes from a separate task-specific LLM judge assessing structural integrity, ground truth alignment, code reasoning, and remediation. No equations, parameters, or derivation chains exist that could reduce outputs to inputs by construction. The evaluation uses an external benchmark dataset and does not rely on self-citations, fitted predictions, or uniqueness theorems imported from prior author work. The judge is described as an independent quality module rather than a tautological re-use of the generator's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LLMs can reliably perform vulnerability identification and analysis when augmented with retrieved external knowledge; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Large language models can perform vulnerability identification, impact assessment, and structured report generation when augmented with retrieved context from curated databases.
This assumption is invoked to justify the Explorer, Analyst, and Reporter agents' capabilities.

pith-pipeline@v0.9.0 · 5592 in / 1317 out tokens · 44832 ms · 2026-05-10T04:40:15.717883+00:00 · methodology

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)