pith. machine review for the scientific record. sign in

arxiv: 2604.11419 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.CR

Recognition: unknown

Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords cyber threat intelligenceretrieval augmented generationknowledge graphsmulti-hop reasoningagentic retrievalhybrid retrievalsystematic evaluation
0
0 comments X

The pith

Hybrid graph-text retrieval improves answer quality by up to 35 percent on multi-hop CTI questions compared to vector RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares four retrieval-augmented generation systems for answering complex questions over collections of cyber threat intelligence reports. Standard vector retrieval often fails when evidence spans multiple documents and requires following relationships between entities such as actors, malware, and vulnerabilities. The evaluation tests a pure graph approach, an agentic graph system that attempts to fix broken queries, and a hybrid that pairs graph queries with text retrieval. On a benchmark of 3,300 questions covering factual lookups, multi-hop relations, synthesis tasks, and unanswerable cases, the hybrid system shows the largest gains on relational questions while remaining more stable than graph-only methods.

Core claim

We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable性能 than

What carries the argument

The hybrid graph-text retrieval architecture, which runs structured queries over an entity-relationship knowledge graph of threats and falls back to or augments with vector text retrieval from the original reports.

If this is right

  • Graph retrieval alone suffices for simple factual lookups over threat entities.
  • Hybrid systems reduce the error spikes seen in pure graph retrieval on ambiguous or incomplete queries.
  • Multi-hop relational questions benefit most from explicit relationship traversal plus text context.
  • Agentic query repair improves graph-only results but does not match the hybrid's consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analyst tools could route queries by detected type, using graphs only when relationships dominate.
  • The same hybrid pattern may apply to other domains where evidence is scattered across narrative reports.
  • Larger or more densely connected threat graphs could widen the measured advantage on multi-hop tasks.

Load-bearing premise

The 3,300 CTI question-answer pairs are representative of real analyst needs and the chosen metrics accurately measure answer quality without bias introduced by how the questions or knowledge graph were built.

What would settle it

A fresh collection of CTI questions written by practicing analysts, scored by the same quality metrics, shows no gain or a loss for the hybrid system over vector RAG on the multi-hop subset.

Figures

Figures reproduced from arXiv: 2604.11419 by Andreas Rauber, Dzenan Hamzic, Florian Skopik, Markus Wurzenberger, Max Landauer.

Figure 1
Figure 1. Figure 1: AI Retrieval in CTI Domain. Retrieval-Augmented Generation (RAG) mitigates this risk by grounding responses in external evidence retrieved at inference time [13, 22]. Most deployed systems implement dense retrieval over chunked text [20], where queries and documents are encoded into continuous vector representations and matched via similarity search (e.g., using approximate nearest-neighbor methods such as… view at source ↗
Figure 2
Figure 2. Figure 2: Data generation and evaluation workflow. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of different RAG systems. Start Guardrails Generate Cypher Validate Cypher Validate Cypher Correct Cypher Execute Cypher Generate Final Answer End [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GRAG system configuration. top-𝑘 similar chunks are retrieved from the vector database and provided to the LLM. The LLM then generates the final answer conditioned on both the user query and the retrieved context, following a predefined RAG prompt10 . 3.4.2 GRAG. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: HybridRAG parallel pipeline read-only Cypher query. The query is validated for syntactic correctness and schema conformity and executed against Neo4j to detect runtime or empty-result errors. If validation fails, the error message and previous query are fed back to the LLM for repair. This generate–validate–repair loop continues until successful execution or a maximum of 25 iterations is reached. Once exec… view at source ↗
Figure 6
Figure 6. Figure 6: Score distribution across RAG systems (LLM-as-a-Judge, 0–50 scale). Each cell reports the percentage of the evaluated question [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean ranks and top-placement frequency by question category (10 runs, LLM-as-a-Judge). (a) Mean rank (1 = best); (b) sole [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPT-5.2 Performance Metrics [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper conducts a systematic empirical evaluation of four RAG architectures for cyber threat intelligence (CTI) analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic graph-query repair variant, and a hybrid graph-text approach. It evaluates these on 3,300 CTI question-answer pairs covering factual lookups, multi-hop relational queries, analyst-style synthesis, and unanswerable cases, claiming that graph grounding helps structured factual queries and that the hybrid approach improves answer quality by up to 35% on multi-hop questions relative to vector RAG while being more reliable than graph-only systems.

Significance. If the evaluation setup is free of construction bias, the work supplies a useful head-to-head comparison of retrieval paradigms in a specialized, high-stakes domain. It offers concrete evidence on when graph augmentation helps multi-hop reasoning over narrative reports and identifies failure modes of pure graph and pure vector approaches, which could inform practical RAG design for relational domains.

major comments (1)
  1. [Abstract and Evaluation Setup] The central 35% improvement claim on multi-hop questions (abstract) rests on the 3,300 QA pairs being an unbiased test of retrieval architecture. The abstract and evaluation description provide no information on how these pairs were generated, their provenance, or whether question construction traversed the CTI knowledge graph or used prompts that explicitly surface entity-relation chains. If the test cases were derived from the graph, vector RAG is structurally disadvantaged because it receives only flat text chunks while the hybrid system receives the relational scaffolding used to create the questions. This is load-bearing for attributing the delta to retrieval method rather than test-set construction; an independent, graph-agnostic source (real analyst logs or expert-authored queries) is required to support the comparative claim.
minor comments (2)
  1. [Abstract] The abstract states quantitative results but supplies no information on the exact metrics used to measure 'answer quality,' whether statistical significance testing was performed, the question-generation process, or controls for dataset bias.
  2. [Methods] Details on how the CTI knowledge graph was constructed, its coverage, and how graph queries are formulated and repaired in the agentic variant are needed to allow replication and to assess generalizability beyond the specific CTI corpus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that clear documentation of the benchmark construction is essential to support the comparative claims, and we will revise the manuscript accordingly to provide this transparency.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Setup] The central 35% improvement claim on multi-hop questions (abstract) rests on the 3,300 QA pairs being an unbiased test of retrieval architecture. The abstract and evaluation description provide no information on how these pairs were generated, their provenance, or whether question construction traversed the CTI knowledge graph or used prompts that explicitly surface entity-relation chains. If the test cases were derived from the graph, vector RAG is structurally disadvantaged because it receives only flat text chunks while the hybrid system receives the relational scaffolding used to create the questions. This is load-bearing for attributing the delta to retrieval method rather than test-set construction; an independent, graph-agnostic source (real analyst logs or expert-authored queries) is required to support the comparative claim.

    Authors: We appreciate the referee's emphasis on this point, as it directly affects the interpretability of our results. The 3,300 QA pairs were created by domain experts in cybersecurity using publicly available CTI reports and standard analyst query patterns drawn from real-world sources. Question generation did not reference or traverse the knowledge graph, nor did it employ prompts designed to surface specific entity-relation chains from the graph; instead, questions were formulated to reflect typical CTI tasks (factual lookups, multi-hop reasoning across reports, synthesis, and unanswerable cases) based solely on the narrative content of the source documents. We acknowledge that the current manuscript provides insufficient detail on this process in the abstract and evaluation sections. In the revised version, we will add a dedicated subsection under Evaluation that describes the benchmark construction, including the expert authorship process, data sources, guidelines used to ensure diversity and realism, and an explicit statement confirming independence from the graph. This addition will allow readers to evaluate whether the test set is graph-agnostic and will strengthen the attribution of performance differences to the retrieval methods themselves. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations

full rationale

The paper is a comparative evaluation of four RAG architectures on a fixed set of 3,300 CTI QA pairs. No equations, parameters, or derivations are present that could reduce to inputs by construction. Performance deltas (e.g., 35% on multi-hop) are measured outcomes, not fitted or self-defined quantities. Self-citations, if any, are not load-bearing for the central empirical claims. The work is self-contained as an external benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical evaluation of existing retrieval paradigms applied to CTI; no new mathematical axioms, free parameters, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5562 in / 1266 out tokens · 109520 ms · 2026-05-10T15:13:27.202951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 36 canonical work pages · 11 internal anchors

  1. [1]

    Renzo Angles and et al. 2018. Foundations of Modern Graph Query Languages.Comput. Surveys50, 5 (2018), 1–40. doi:10.1145/3104031

  2. [2]

    Anonymous. 2025. A Self-Correcting Agentic Graph RAG Framework for Clinical Decision Support.Frontiers in Medicine(2025). https: //www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2025.1716327/full Agentic RAG for clinical QA built on hepatology knowledge graph; significantly outperforms baseline RAG and GraphRAG

  3. [3]

    Akari Asai and et al. 2020. Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering. InInternational Conference on Learning Representations (ICLR)

  4. [4]

    Authors. 2024. KnowCTI: Knowledge-based cyber threat intelligence entity and relation extraction.Journal of Computers & Security(2024). https://www.sciencedirect.com/science/article/pii/S0167404824001251 Example CTI ontology with entities and relations

  5. [5]

    Victor Barres and et al. 2025. 𝜏2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI] https: //arxiv.org/abs/2506.07982

  6. [6]

    Kathrin Blagec and et al. 2022. A global analysis of metrics used for measuring performance in natural language processing. InProceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP. Association for Computational Linguistics, Dublin, Ireland, 52–63. doi:10.18653/v1/2022. nlppower-1.6

  7. [7]

    Yutong Cheng and et al. 2025. CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models. In 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P). 923–938. doi:10.1109/EuroSP63326.2025.00057

  8. [8]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2025. From Local to Global: A GraphRAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] arXiv:2404.16130v2

  9. [9]

    Cwm: An open-weights llm for research on code generation with world models

    FAIR CodeGen team and et al. 2025.CWM: An Open-Weights LLM for Research on Code Generation with World Models. arXiv:2510.02387 [cs.SE]

  10. [10]

    Nadime Francis and et al. 2018. Cypher: An Evolving Query Language for Property Graphs. InProceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery. doi:10.1145/3183713.3190657

  11. [11]

    Jinlan Fu and et al. 2023. GPTScore: Evaluate as You Desire.arXiv preprint arXiv:2302.04166(2023)

  12. [12]

    Yunfan Gao and et al. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997(2023)

  13. [13]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] doi:10.48550/arXiv.2312.10997

  14. [14]

    A Survey on LLM-as-a-Judge

    Jiawei Gu and et al. 2025.A Survey on LLM-as-a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594

  15. [15]

    Dzenan Hamzic and et al. 2025. Enhancing Cyber Situational Awareness with AI: A Novel Pipeline Approach for Threat Intelligence Analysis and Enrichment. InA vailability, Reliability and Security. Springer Nature Switzerland, Cham, 44–62

  16. [16]

    Yuqing Huang and et al. 2025. SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 14175–14190. doi:10.18653/v1/2025.findings-emnlp.763

  17. [17]

    Ziwei Ji and et al. 2022. Survey of Hallucination in Natural Language Generation.Comput. Surveys(2022). arXiv:2202.03629 [cs.CL] doi:10.1145/3571730

  18. [18]

    Ziyan Jiang and et al. 2024. LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs. arXiv:2406.15319 [cs.CL] https: //arxiv.org/abs/2406.15319

  19. [19]

    Jeff Johnson and et al. 2017. Billion-Scale Similarity Search with GPUs.arXiv preprint arXiv:1702.08734(2017). doi:10.48550/arXiv.1702.08734

  20. [20]

    Vladimir Karpukhin and et al. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.550

  21. [21]

    Tom Kwiatkowski and et al. 2019. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics7 (2019), 453–466

  22. [22]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. arXiv:2005.11401 [cs.CL] doi:10.4...

  23. [23]

    Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, and Ting Wang. 2025. GraphRAG under Fire. arXiv:2501.14050 [cs.LG] arXiv:2501.14050v1

  24. [24]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

  25. [25]

    Yang Liu and et al. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.arXiv preprint arXiv:2303.16634(2023). Manuscript submitted to ACM Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval 35

  26. [26]

    Anket Mehra and et al. 2025. Improving Applicability of Deep Learning based Token Classification models during Training. arXiv:2504.01028 [cs.CV] https://arxiv.org/abs/2504.01028

  27. [27]

    Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  28. [28]

    Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv:2306.08302 [cs.CL] arXiv:2306.08302v3

  29. [29]

    George Papageorgiou, Vangelis Sarlis, Manolis Maragoudakis, and Christos Tjortjis. 2025. Hybrid Multi-Agent GraphRAG for E-Government: Towards a Trustworthy AI Assistant.Applied Sciences15, 11 (2025), 6315. doi:10.3390/app15116315

  30. [30]

    Boci Peng and et al. 2025. Graph Retrieval-Augmented Generation: A Survey.ACM Transactions on Information Systems44, 2 (2025), 1–52. doi:10.1145/3777378

  31. [31]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10.18653/v1/P18-2124

  32. [32]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2383–2392

  33. [33]

    Injy Sarhan and Marco Spruit. 2021. Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph.Knowledge-Based Systems(2021). doi:10.1016/j.knosys.2021.107524

  34. [34]

    Noah Shinn and et al. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36. https://arxiv.org/abs/2303.11366

  35. [35]

    Haitian Sun and et al. 2019. PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, C...

  36. [36]

    C. J. van Rijsbergen. 1979.Information Retrieval. Butterworth-Heinemann

  37. [37]

    Zhitong Wang and et al. 2025. Document Segmentation Matters for Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 8063–8075. doi:10.18653/v1/2025.findings-acl.422

  38. [38]

    Junde Wu and et al. 2025. Medical Graph RAG: Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Long Papers). Association for Computational Linguistics, Vienna, Austria, 28443–28467. doi:10.18653/v1/2025.acl-long.1381

  39. [40]

    , booktitle =

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. doi:10.18653/v1/D18-1259

  40. [41]

    Shunyu Yao and et al. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] doi:10.48550/arXiv.2210.03629

  41. [42]

    Qinggang Zhang and et al. 2025. When to Use Graphs in Retrieval-Augmented Generation: A Comprehensive Analysis for Graph Retrieval-Augmented Generation.arXiv preprint arXiv:2506.05690(2025). https://arxiv.org/abs/2506.05690

  42. [43]

    Tianyi Zhang and et al. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675

  43. [44]

    Lianmin Zheng and et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685(2023)

  44. [45]

    Xiangrong Zhu and et al. 2025. Knowledge Graph-Guided Retrieval Augmented Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2025) (NAACL ’25). Association for Computational Linguistics, Albuquerque, New Mexico, USA, 8912–8924. doi:10....