arxiv: 2605.03140 · v1 · submitted 2026-05-04 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

Evaluating Retrieval-Augmented Generation for Explainable Malware Analysis

Jayson Ng , Amin Milani Fard

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:17 UTC · model grok-4.3

classification 💻 cs.CR

keywords retrieval augmented generationmalware analysislarge language modelsexplanation qualityVirusTotalsecurity engineeringRAGsignal extraction

0 comments

The pith

RAG frequently degrades the quality of LLM explanations for malware analysis

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper empirically tests whether retrieval-augmented generation enhances large language models' ability to explain malware behavior from VirusTotal reports. It finds that RAG often worsens the output by adding distracting context, narrative noise, and generic descriptions across multiple models. Readers should care because this challenges the widespread assumption in security applications that more retrieval always improves LLM performance. The authors position malware explanation as a signal-extraction problem where sufficient structured input makes additional knowledge retrieval counterproductive.

Core claim

Across multiple LLMs, RAG frequently degrades explanation quality by introducing distracting or weakly related context and adding narrative noise or generic write-ups when using VirusTotal reports as input. The results highlight a practical risk in security-critical pipelines where RAG can be counterproductive when structured security evidence is already sufficient. Malware explanation is primarily a signal-extraction task, not a knowledge-retrieval problem.

What carries the argument

Comparative evaluation of RAG-augmented versus standard LLM generations for malware explanations, with quality assessed via metrics and judgments on VirusTotal report inputs.

If this is right

RAG can be counterproductive in malware explanation pipelines and should be used with caution or avoided when input data is structured.
Security workflows benefit from focusing on direct signal extraction from provided evidence rather than external retrieval.
LLM-based tools for analysts require domain-specific testing to determine if augmentation improves or harms output.
Design of explainable malware analysis systems should prioritize simplicity when evidence is rich.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar degradation could occur in other analysis tasks with structured inputs like log files or forensic reports.
Conditional retrieval mechanisms might be developed to activate RAG only when gaps in the input are identified.
This suggests that fine-tuning models on domain data could replace the need for RAG in security explanations.

Load-bearing premise

The chosen metrics and human judgments for explanation quality accurately capture usefulness to malware analysts and VirusTotal reports represent sufficiently rich structured input where retrieval adds no value.

What would settle it

A study where malware analysts are given pairs of explanations and rate which is more useful for their work, with consistent preference against the RAG versions.

Figures

Figures reproduced from arXiv: 2605.03140 by Amin Milani Fard, Jayson Ng.

**Figure 1.** Figure 1: RAG LLM architecture for malware explanation. view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly being used as security engineering tools to summarize and explain malware behavior to analysts. A common assumption is that Retrieval-Augmented Generation (RAG) improves explanation quality by injecting external security knowledge. In this work, we empirically evaluate this assumption for malware explanation using VirusTotal reports as structured input. Across multiple LLMs, we find that RAG frequently degrades explanation quality by introducing distracting or weakly related context and adding narrative noise or generic write-ups. Our results highlight a practical risk in security-critical pipelines for malware explanation that RAG can be counterproductive when structured security evidence is already sufficient. We argue that malware explanation is primarily a signal-extraction task, not a knowledge-retrieval problem, and outline design recommendations for secure development workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAG degrades explanation quality for malware analysis with VirusTotal reports, but experimental details are lacking.

read the letter

Hi, the one thing to take away from this paper is that RAG often reduces the quality of explanations for malware behavior when the LLM is given structured VirusTotal reports as input. Instead of helping, it tends to introduce distracting context and generic narrative that dilutes the analysis. What the work does well is to test this idea head to head with and without retrieval across a few different LLMs. It makes a clear case that for this kind of task the problem is not missing knowledge but extracting the right information from the provided data. The authors then draw out some straightforward recommendations for building these tools without defaulting to retrieval, which could help avoid unnecessary risks in security settings. On the downside, the evaluation details are thin. There is no information on the number of malware samples used, how explanation quality was measured exactly, or any statistical analysis. The retrieval setup itself is not described in terms of the source corpus, embedding model, or how many items were retrieved. This opens the door to the possibility that the observed degradation comes from a poorly tuned retrieval system rather than from RAG as a general approach. If the paper has more on these points in the full text, that would address the main concern. Overall, this is the kind of paper that would interest researchers and engineers working on LLM applications in cybersecurity and malware analysis. It provides a concrete example against an assumption that many people make. A reader looking for practical insights on when retrieval helps or hurts in domain-specific tasks would get something from it. I think it is worth sending out for peer review. Referees could clarify the experimental controls and help determine how general the finding really is.

Referee Report

2 major / 2 minor

Summary. The paper empirically evaluates Retrieval-Augmented Generation (RAG) versus non-RAG setups for LLM-based explanation of malware behavior, using VirusTotal reports as structured input. Across multiple LLMs, it reports that RAG frequently degrades explanation quality through distracting or weakly related context, narrative noise, and generic write-ups. The authors conclude that malware explanation is primarily a signal-extraction task rather than a knowledge-retrieval problem and provide design recommendations for security workflows.

Significance. If the central empirical finding holds under controlled conditions, the result would be significant for LLM applications in cybersecurity: it would demonstrate a domain where RAG can be counterproductive when the input already contains rich, structured evidence, thereby informing more efficient pipeline designs for malware analysis and reducing unnecessary retrieval overhead in security-critical tools.

major comments (2)

[Abstract / §3] Abstract and §3 (or equivalent experimental setup section): the claim that 'RAG frequently degrades explanation quality' is presented without any reported sample size, number of malware instances, exact quality metrics (human or automated), inter-annotator agreement, or statistical tests. This absence makes it impossible to assess whether the observed degradation is statistically reliable or generalizable across LLMs.
[§4] §4 (RAG pipeline description): no details are supplied on the retrieval corpus, embedding model, chunking strategy, top-k selection, or similarity threshold. The degradation result is therefore load-bearing on an uncharacterized implementation; if retrieval draws from a noisy general corpus rather than a curated malware-knowledge base, noise is expected by construction and does not demonstrate that RAG is inherently counterproductive for structured VirusTotal inputs.

minor comments (2)

[Evaluation] Clarify the precise definition of 'explanation quality' used in the evaluation (e.g., factual accuracy, analyst utility, conciseness) and whether any automated proxies were cross-validated against human judgments.
[Abstract / Results] The abstract states results 'across multiple LLMs' but does not name the models or report per-model breakdowns; adding a table with per-LLM outcomes would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional experimental details and pipeline specifications.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (or equivalent experimental setup section): the claim that 'RAG frequently degrades explanation quality' is presented without any reported sample size, number of malware instances, exact quality metrics (human or automated), inter-annotator agreement, or statistical tests. This absence makes it impossible to assess whether the observed degradation is statistically reliable or generalizable across LLMs.

Authors: We acknowledge that the manuscript does not report these details explicitly in the abstract or §3. In the revised version we will expand the experimental setup section to state the number of malware instances evaluated, the precise quality metrics (human ratings on dimensions such as accuracy, relevance, and conciseness together with any automated metrics), inter-annotator agreement, and the statistical tests performed to establish the significance of the observed degradation. These additions will allow readers to evaluate reliability and generalizability directly. revision: yes
Referee: [§4] §4 (RAG pipeline description): no details are supplied on the retrieval corpus, embedding model, chunking strategy, top-k selection, or similarity threshold. The degradation result is therefore load-bearing on an uncharacterized implementation; if retrieval draws from a noisy general corpus rather than a curated malware-knowledge base, noise is expected by construction and does not demonstrate that RAG is inherently counterproductive for structured VirusTotal inputs.

Authors: We agree that §4 lacks sufficient implementation detail. The retrieval corpus used in our experiments was assembled from curated security reports and malware-analysis documents rather than an open web corpus; however, we recognize that this must be stated explicitly. In the revision we will document the embedding model, chunking strategy, top-k selection, similarity threshold, and the exact composition of the corpus. These additions will clarify that the degradation occurs even with a domain-specific corpus when the primary input (VirusTotal reports) already supplies structured evidence, supporting our claim that malware explanation is primarily a signal-extraction task. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison without derivations or self-referential predictions

full rationale

The paper is a straightforward empirical evaluation of RAG versus non-RAG LLM outputs for malware explanation on VirusTotal reports. It reports observed degradation in explanation quality across multiple models but contains no derivation chain, equations, fitted parameters, or first-principles claims that could reduce to their own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to support the central result. The analysis rests on direct experimental comparisons and human judgments, which are externally falsifiable and independent of any internal redefinition or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical comparison and the domain assumption that VirusTotal reports already contain enough structured evidence that external retrieval tends to add noise.

axioms (1)

domain assumption VirusTotal reports provide sufficient structured security evidence for high-quality malware explanations without external retrieval
The paper treats these reports as the primary input and concludes retrieval is counterproductive, implying this premise.

pith-pipeline@v0.9.0 · 5420 in / 1201 out tokens · 53481 ms · 2026-05-08T18:17:34.074936+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 1 internal anchor

[1]

NeurIPS'20 , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. NeurIPS'20 , volume=
[2]

arXiv preprint arXiv:2406.11147 , year =

Du, Xiaoyu and Zheng, Guanyu and Wang, Kai and Feng, Jie and Deng, Wei and Liu, Ming and Chen, Bin and Peng, Xin and Lou, Yutong , title =. arXiv preprint arXiv:2406.11147 , year =

work page arXiv
[3]

IEEE Access , volume=

Machine learning algorithm for malware detection: Taxonomy, current challenges, and future directions , author=. IEEE Access , volume=. 2023 , publisher=

2023
[4]

ECML PKDD'25 , pages=

MalGPT: A Generative Explainable Model for Malware Binaries , author=. ECML PKDD'25 , pages=. 2025 , organization=

2025
[5]

arXiv preprint arXiv:2509.03626 , year=

Explainable Knowledge Graph Retrieval-Augmented Generation (KG-RAG) with KG-SMILE , author=. arXiv preprint arXiv:2509.03626 , year=

work page arXiv
[6]

2026 , eprint=

A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges , author=. 2026 , eprint=

2026
[7]

Retrieval-augmented generation: A com- prehensive survey of architectures, enhancements, and ro- bustness frontiers.arXiv preprint arXiv:2506.00054, 2025

Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers , author=. arXiv preprint arXiv:2506.00054 , year=

work page arXiv
[8]

2024 Winter Simulation Conference (WSC) , pages=

Enhancing GPT-3.5's Proficiency in Netlogo Through Few-Shot Prompting and Retrieval-Augmented Generation , author=. 2024 Winter Simulation Conference (WSC) , pages=. 2024 , organization=

2024
[9]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[10]

Interplay of parametric and contextual knowledge: A study of parametric knowledge utilisation in LLMs

Understanding the interplay between parametric and contextual knowledge for large language models , author=. arXiv preprint arXiv:2410.08414 , year=

work page arXiv
[11]

arXiv preprint arXiv:2511.11847 , year=

A Multimodal Manufacturing Safety Chatbot: Knowledge Base Design, Benchmark Development, and Evaluation of Multiple RAG Approaches , author=. arXiv preprint arXiv:2511.11847 , year=

work page arXiv
[12]

2025 Winter Simulation Conference (WSC) , pages=

AURORA: Enhancing Synthetic Population Realism Through RAG and Salience-Aware Opinion Modeling , author=. 2025 Winter Simulation Conference (WSC) , pages=. 2025 , organization=

2025
[13]

Proceedings of the 8th ACM SIGSPATIAL International Workshop on Geospatial Simulation , pages=

Knowledge-Guided Large Language Models for Enhancing Agent-Based Wildfire Spatial Simulation , author=. Proceedings of the 8th ACM SIGSPATIAL International Workshop on Geospatial Simulation , pages=
[14]

ACL'25 , year =

The Distracting Effect: Understanding Irrelevant Passages in RAG , author =. ACL'25 , year =
[15]

Long-context llms meet rag: Overcoming challenges for long inputs in rag.arXiv preprint arXiv:2410.05983, 2024

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG , author=. arXiv preprint arXiv:2410.05983 , year =

work page arXiv
[16]

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG , volume =

Jin, Bowen and Yoon, Jinsung and Han, Jiawei and Arik, Sercan , booktitle =. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG , volume =
[17]

ICLR'25 , year =

Sufficient Context: A New Lens on Retrieval-Augmented Generation Systems , author =. ICLR'25 , year =
[18]

ICLR'25 , year =

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems , author =. ICLR'25 , year =
[19]

2025 , howpublished =

Deeper insights into retrieval augmented generation: The role of sufficient context , author =. 2025 , howpublished =

2025
[20]

Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models , author =. Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

2025
[21]

doi:10.48550/arXiv.2401.00396 , abstract =

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models , author =. arXiv preprint arXiv:2401.00396 , year =

work page arXiv
[22]

Rage against the machine: Retrieval-augmented llm explanations, 2024

RAGE Against the Machine: Retrieval-Augmented LLM Explanations , author =. arXiv preprint arXiv:2405.13000 , year =

work page arXiv
[23]

2025 , url =

A complete guide to RAG evaluation: metrics, testing and best practices , author =. 2025 , url =

2025
[24]

2025 , url =

MalGPT: A Generative Explainable Model for Malware Binaries (author version PDF) , author =. 2025 , url =

2025
[25]

2025 , url =

MalGPT: Generative Explainable Model for Malware Binaries , author =. 2025 , url =

2025
[26]

ICLR'20 , year =

BERTScore: Evaluating Text Generation with BERT , author =. ICLR'20 , year =
[27]

MalCVE: Malware Detection and

Cristea, Eduard Andrei and Molnes, Petter and Li, Jingyue , year =. MalCVE: Malware Detection and. arXiv , url =
[28]

2025 , publisher=

Alhuzali, Abeer , journal=. 2025 , publisher=

2025
[29]

Retrieval Augmented Generation for Robust Cyber Defense , author =
[30]

2025 , month =

RAG and Agentic. 2025 , month =

2025
[31]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review arXiv
[32]

2025 , month =

Neeraj Kumar , title =. 2025 , month =

2025
[33]

2025 , month =

Adarsh C S and Manvanthara Puttashankar , title =. 2025 , month =

2025