Recognition: unknown
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3
The pith
Hybrid graph-text retrieval improves answer quality by up to 35 percent on multi-hop CTI questions compared to vector RAG.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable性能 than
What carries the argument
The hybrid graph-text retrieval architecture, which runs structured queries over an entity-relationship knowledge graph of threats and falls back to or augments with vector text retrieval from the original reports.
If this is right
- Graph retrieval alone suffices for simple factual lookups over threat entities.
- Hybrid systems reduce the error spikes seen in pure graph retrieval on ambiguous or incomplete queries.
- Multi-hop relational questions benefit most from explicit relationship traversal plus text context.
- Agentic query repair improves graph-only results but does not match the hybrid's consistency.
Where Pith is reading between the lines
- Analyst tools could route queries by detected type, using graphs only when relationships dominate.
- The same hybrid pattern may apply to other domains where evidence is scattered across narrative reports.
- Larger or more densely connected threat graphs could widen the measured advantage on multi-hop tasks.
Load-bearing premise
The 3,300 CTI question-answer pairs are representative of real analyst needs and the chosen metrics accurately measure answer quality without bias introduced by how the questions or knowledge graph were built.
What would settle it
A fresh collection of CTI questions written by practicing analysts, scored by the same quality metrics, shows no gain or a loss for the hybrid system over vector RAG on the multi-hop subset.
Figures
read the original abstract
Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic empirical evaluation of four RAG architectures for cyber threat intelligence (CTI) analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic graph-query repair variant, and a hybrid graph-text approach. It evaluates these on 3,300 CTI question-answer pairs covering factual lookups, multi-hop relational queries, analyst-style synthesis, and unanswerable cases, claiming that graph grounding helps structured factual queries and that the hybrid approach improves answer quality by up to 35% on multi-hop questions relative to vector RAG while being more reliable than graph-only systems.
Significance. If the evaluation setup is free of construction bias, the work supplies a useful head-to-head comparison of retrieval paradigms in a specialized, high-stakes domain. It offers concrete evidence on when graph augmentation helps multi-hop reasoning over narrative reports and identifies failure modes of pure graph and pure vector approaches, which could inform practical RAG design for relational domains.
major comments (1)
- [Abstract and Evaluation Setup] The central 35% improvement claim on multi-hop questions (abstract) rests on the 3,300 QA pairs being an unbiased test of retrieval architecture. The abstract and evaluation description provide no information on how these pairs were generated, their provenance, or whether question construction traversed the CTI knowledge graph or used prompts that explicitly surface entity-relation chains. If the test cases were derived from the graph, vector RAG is structurally disadvantaged because it receives only flat text chunks while the hybrid system receives the relational scaffolding used to create the questions. This is load-bearing for attributing the delta to retrieval method rather than test-set construction; an independent, graph-agnostic source (real analyst logs or expert-authored queries) is required to support the comparative claim.
minor comments (2)
- [Abstract] The abstract states quantitative results but supplies no information on the exact metrics used to measure 'answer quality,' whether statistical significance testing was performed, the question-generation process, or controls for dataset bias.
- [Methods] Details on how the CTI knowledge graph was constructed, its coverage, and how graph queries are formulated and repaired in the agentic variant are needed to allow replication and to assess generalizability beyond the specific CTI corpus.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that clear documentation of the benchmark construction is essential to support the comparative claims, and we will revise the manuscript accordingly to provide this transparency.
read point-by-point responses
-
Referee: [Abstract and Evaluation Setup] The central 35% improvement claim on multi-hop questions (abstract) rests on the 3,300 QA pairs being an unbiased test of retrieval architecture. The abstract and evaluation description provide no information on how these pairs were generated, their provenance, or whether question construction traversed the CTI knowledge graph or used prompts that explicitly surface entity-relation chains. If the test cases were derived from the graph, vector RAG is structurally disadvantaged because it receives only flat text chunks while the hybrid system receives the relational scaffolding used to create the questions. This is load-bearing for attributing the delta to retrieval method rather than test-set construction; an independent, graph-agnostic source (real analyst logs or expert-authored queries) is required to support the comparative claim.
Authors: We appreciate the referee's emphasis on this point, as it directly affects the interpretability of our results. The 3,300 QA pairs were created by domain experts in cybersecurity using publicly available CTI reports and standard analyst query patterns drawn from real-world sources. Question generation did not reference or traverse the knowledge graph, nor did it employ prompts designed to surface specific entity-relation chains from the graph; instead, questions were formulated to reflect typical CTI tasks (factual lookups, multi-hop reasoning across reports, synthesis, and unanswerable cases) based solely on the narrative content of the source documents. We acknowledge that the current manuscript provides insufficient detail on this process in the abstract and evaluation sections. In the revised version, we will add a dedicated subsection under Evaluation that describes the benchmark construction, including the expert authorship process, data sources, guidelines used to ensure diversity and realism, and an explicit statement confirming independence from the graph. This addition will allow readers to evaluate whether the test set is graph-agnostic and will strengthen the attribution of performance differences to the retrieval methods themselves. revision: yes
Circularity Check
No circularity: purely empirical benchmark with no derivations
full rationale
The paper is a comparative evaluation of four RAG architectures on a fixed set of 3,300 CTI QA pairs. No equations, parameters, or derivations are present that could reduce to inputs by construction. Performance deltas (e.g., 35% on multi-hop) are measured outcomes, not fitted or self-defined quantities. Self-citations, if any, are not load-bearing for the central empirical claims. The work is self-contained as an external benchmark study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Renzo Angles and et al. 2018. Foundations of Modern Graph Query Languages.Comput. Surveys50, 5 (2018), 1–40. doi:10.1145/3104031
-
[2]
Anonymous. 2025. A Self-Correcting Agentic Graph RAG Framework for Clinical Decision Support.Frontiers in Medicine(2025). https: //www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2025.1716327/full Agentic RAG for clinical QA built on hepatology knowledge graph; significantly outperforms baseline RAG and GraphRAG
-
[3]
Akari Asai and et al. 2020. Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering. InInternational Conference on Learning Representations (ICLR)
2020
-
[4]
Authors. 2024. KnowCTI: Knowledge-based cyber threat intelligence entity and relation extraction.Journal of Computers & Security(2024). https://www.sciencedirect.com/science/article/pii/S0167404824001251 Example CTI ontology with entities and relations
2024
-
[5]
Victor Barres and et al. 2025. 𝜏2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI] https: //arxiv.org/abs/2506.07982
work page internal anchor Pith review arXiv 2025
-
[6]
Kathrin Blagec and et al. 2022. A global analysis of metrics used for measuring performance in natural language processing. InProceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP. Association for Computational Linguistics, Dublin, Ireland, 52–63. doi:10.18653/v1/2022. nlppower-1.6
-
[7]
Yutong Cheng and et al. 2025. CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models. In 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P). 923–938. doi:10.1109/EuroSP63326.2025.00057
-
[8]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2025. From Local to Global: A GraphRAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] arXiv:2404.16130v2
work page internal anchor Pith review arXiv 2025
-
[9]
Cwm: An open-weights llm for research on code generation with world models
FAIR CodeGen team and et al. 2025.CWM: An Open-Weights LLM for Research on Code Generation with World Models. arXiv:2510.02387 [cs.SE]
-
[10]
Nadime Francis and et al. 2018. Cypher: An Evolving Query Language for Property Graphs. InProceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery. doi:10.1145/3183713.3190657
- [11]
-
[12]
Yunfan Gao and et al. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] doi:10.48550/arXiv.2312.10997
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
-
[14]
Jiawei Gu and et al. 2025.A Survey on LLM-as-a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Dzenan Hamzic and et al. 2025. Enhancing Cyber Situational Awareness with AI: A Novel Pipeline Approach for Threat Intelligence Analysis and Enrichment. InA vailability, Reliability and Security. Springer Nature Switzerland, Cham, 44–62
2025
-
[16]
Yuqing Huang and et al. 2025. SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 14175–14190. doi:10.18653/v1/2025.findings-emnlp.763
-
[17]
Ziwei Ji and et al. 2022. Survey of Hallucination in Natural Language Generation.Comput. Surveys(2022). arXiv:2202.03629 [cs.CL] doi:10.1145/3571730
- [18]
-
[19]
Jeff Johnson and et al. 2017. Billion-Scale Similarity Search with GPUs.arXiv preprint arXiv:1702.08734(2017). doi:10.48550/arXiv.1702.08734
-
[20]
Vladimir Karpukhin and et al. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.550
-
[21]
Tom Kwiatkowski and et al. 2019. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics7 (2019), 453–466
2019
-
[22]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. arXiv:2005.11401 [cs.CL] doi:10.4...
work page internal anchor Pith review doi:10.48550/arxiv.2005.11401 2020
- [23]
-
[24]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/
2004
-
[25]
Yang Liu and et al. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.arXiv preprint arXiv:2303.16634(2023). Manuscript submitted to ACM Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval 35
work page internal anchor Pith review arXiv 2023
- [26]
-
[27]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2019
- [28]
-
[29]
George Papageorgiou, Vangelis Sarlis, Manolis Maragoudakis, and Christos Tjortjis. 2025. Hybrid Multi-Agent GraphRAG for E-Government: Towards a Trustworthy AI Assistant.Applied Sciences15, 11 (2025), 6315. doi:10.3390/app15116315
-
[30]
Boci Peng and et al. 2025. Graph Retrieval-Augmented Generation: A Survey.ACM Transactions on Information Systems44, 2 (2025), 1–52. doi:10.1145/3777378
-
[31]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10.18653/v1/P18-2124
-
[32]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2383–2392
2016
-
[33]
Injy Sarhan and Marco Spruit. 2021. Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph.Knowledge-Based Systems(2021). doi:10.1016/j.knosys.2021.107524
-
[34]
Noah Shinn and et al. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36. https://arxiv.org/abs/2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Haitian Sun and et al. 2019. PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, C...
-
[36]
C. J. van Rijsbergen. 1979.Information Retrieval. Butterworth-Heinemann
1979
-
[37]
Zhitong Wang and et al. 2025. Document Segmentation Matters for Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 8063–8075. doi:10.18653/v1/2025.findings-acl.422
-
[38]
Junde Wu and et al. 2025. Medical Graph RAG: Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Long Papers). Association for Computational Linguistics, Vienna, Austria, 28443–28467. doi:10.18653/v1/2025.acl-long.1381
-
[40]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. doi:10.18653/v1/D18-1259
-
[41]
Shunyu Yao and et al. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] doi:10.48550/arXiv.2210.03629
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023
- [42]
-
[43]
Tianyi Zhang and et al. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675
work page internal anchor Pith review arXiv 2020
-
[44]
Lianmin Zheng and et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685(2023)
work page internal anchor Pith review arXiv 2023
-
[45]
Xiangrong Zhu and et al. 2025. Knowledge Graph-Guided Retrieval Augmented Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2025) (NAACL ’25). Association for Computational Linguistics, Albuquerque, New Mexico, USA, 8912–8924. doi:10....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.