DEFENGRAPH: Knowledge Graph-Enhanced LLMs for Blue Team Cyber Defense

Ahmad Mohsin; Ahmed Ibrahim; Diksha Goel; Gang Li; Guangsheng Yu; Helge Janicke; Kristen Moore; Minjune Kim; Qin Wang; Zhen Wang

arxiv: 2606.21059 · v1 · pith:S2NAK5KKnew · submitted 2026-06-19 · 💻 cs.CR

DEFENGRAPH: Knowledge Graph-Enhanced LLMs for Blue Team Cyber Defense

Zhen Wang , Kristen Moore , Qin Wang , Guangsheng Yu , Minjune Kim , Diksha Goel , Gang Li , Ahmed Ibrahim

show 2 more authors

Ahmad Mohsin Helge Janicke

This is my paper

Pith reviewed 2026-06-26 14:09 UTC · model grok-4.3

classification 💻 cs.CR

keywords knowledge graphlarge language modelscybersecurityblue team defenseSIEM alertscyber range exercisesdecision supportgraph retrieval

0 comments

The pith

A dual-layer static-dynamic knowledge graph grounds LLMs to raise accuracy in cyber defense reasoning and actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DEFENGRAPH, an LLM assistant for blue team cyber defense that builds a dual-layer static-dynamic knowledge graph from SIEM alerts, system topology, attacker behaviors, and prior defensive actions. Graph-based path retrieval, LLM-driven contextual filtering, and reasoning-based re-ranking then connect long-term domain knowledge with evolving event context. Evaluations on data from live Red versus Blue cyber range exercises show higher reasoning-recall and ticket-action recall across GPT-4o, LLaMA-3, DeepSeek-R1, and QWen-3 while fault rates stay steady and more correct defense actions surface. The approach targets hallucinations and shallow temporal reasoning that limit plain LLMs in high-stakes, time-evolving settings.

Core claim

DEFENGRAPH integrates a dual-layer Static-Dynamic Knowledge Graph with graph-based path retrieval, LLM-driven contextual filtering, and reasoning-based re-ranking to ground LLM outputs in both long-term domain knowledge and evolving event context from heterogeneous security artifacts, enabling faithful and temporally aware decision support as measured by improved recall metrics on realistic noisy datasets from cyber range exercises.

What carries the argument

Dual-layer Static-Dynamic Knowledge Graph together with graph-based path retrieval, LLM-driven contextual filtering, and reasoning-based re-ranking.

If this is right

Reasoning-recall rises from 61.45% to 73.49% on GPT-4o.
Ticket-action recall rises from 52.17% to 72.46% on GPT-4o with precision moving from 24.49% to 29.24%.
Up to 50 correct defense actions surface versus 36 for the next best baseline.
Comparable recall gains appear on LLaMA-3, DeepSeek-R1, and QWen-3.
Fault rates remain steady across the tested models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding pattern could apply to other high-stakes domains that require tracking evolving states, such as network operations or industrial control.
Structured external memory may allow smaller or less specialized models to reach performance levels otherwise needing larger ones.
Maintaining an accurate dynamic layer in live environments would require automated update mechanisms not tested in the exercises.

Load-bearing premise

Knowledge graphs built from SIEM alerts, system topology, attacker behaviors, and prior defensive actions faithfully represent both long-term domain knowledge and evolving event context in a manner that directly enables the observed improvements.

What would settle it

Replacing the constructed knowledge graphs with random connections or incomplete security data and checking whether the recall gains on reasoning and actions disappear would settle whether the graph integration drives the results.

Figures

Figures reproduced from arXiv: 2606.21059 by Ahmad Mohsin, Ahmed Ibrahim, Diksha Goel, Gang Li, Guangsheng Yu, Helge Janicke, Kristen Moore, Minjune Kim, Qin Wang, Zhen Wang.

**Figure 2.** Figure 2: Complex and noisy context in Cybersecurity defense. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: System overview of DEFENGRAPH. Historical alerts and tickets are structured into a static KG, while new Wazuh alerts provide dynamic knowledge. The system KG encodes the host and service topology. A static–dynamic KG is built by fusing historical and real-time data. Given a new alert, the system extracts subgraphs, ranks reasoning paths with an LLM, and generates context-aware defense actions aligned with … view at source ↗

**Figure 4.** Figure 4: Red Team Attack Tracing. Left shows the attack trace from the red team. Red Nodes represent attacking actions performed by red team members. For example, the attack 8 (red09 attack 8) is carried out by the red team member 09. Blue Nodes and Purple Nodes indicate servers or firewalls. Right shows the attacking shell scripts used in red team attack 8, along with the corresponding defense ticket from the blue… view at source ↗

**Figure 5.** Figure 5: The ACDC Knowledge Graph example demonstrates the nodes and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Wazuh alert graph query on the entity node RSLGB [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of incorrect aggressive actions suggested by LLMs. Top: [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Output comparison of LLM-based Contextual Filter between the DG (SKG) and the DG-Full. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 11.** Figure 11: Demonstration of Top-K generated relation paths (§III-C) and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 9.** Figure 9: Example generated action generated by Full D [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Example generated response by DG (SKG) based on static KG with [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) show promise for supporting decision-making in cybersecurity, but their reliability in high-stakes, time-evolving environments remains limited due to hallucinations, poor temporal reasoning, and shallow grounding in system context. We introduce DEFENGRAPH, an LLM-driven assistant designed to support human defenders during cybersecurity incidents. DEFENGRAPH improves contextual reasoning by integrating a dual-layer Static-Dynamic Knowledge Graph (KG) with graph-based path retrieval, LLM-driven contextual filtering, and reasoning-based re-ranking. The framework grounds LLM outputs in both long-term domain knowledge and evolving event context, enabling faithful and temporally aware decision support. We evaluate DEFENGRAPH in a cyber defense setting using knowledge graphs constructed from heterogeneous security artifacts, including SIEM alerts, system topology, attacker behaviors, and prior defensive actions. The evaluation uses data collected during live Red vs. Blue team cyber range exercises simulating attacks on critical infrastructure, which generate realistic and noisy datasets reflecting real-world defender workflows and system dynamics. Evaluations across four prevalent LLMs show that DEFENGRAPH sets a new state-of-the-art: on GPT-4o it boosts reasoning-recall from 61.45\% to 73.49\% and ticket-action recall from 52.17% to 72.46% (precision 24.49\% to 29.24\%), with similar gains on LLaMA-3 (46.99\% to 61.45\%), DeepSeek-R1 (45.78\% to 56.63\%) and QWen-3 (51.81\% to 59.04\%), while surfacing up to 50 correct defense actions versus 36 for the next best baseline and holding fault rates steady.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DEFENGRAPH shows recall gains on cyber range data with a dual-layer KG pipeline, but the abstract gives no evidence that the graph structure itself produces the lift rather than added context.

read the letter

The main things to know are that this paper reports concrete recall improvements on GPT-4o and three other LLMs when a static-dynamic knowledge graph is added to support defense decisions, and that the evaluation uses data from actual red-blue cyber range exercises on critical infrastructure. The architecture combines graph path retrieval, LLM filtering, and re-ranking to ground outputs in both long-term knowledge and live events.

What is new is the explicit dual-layer KG design tailored to SIEM alerts, topology, attacker behaviors, and prior actions, then applied to noisy live exercise logs. The paper does a reasonable job testing the same pipeline across multiple models and showing consistent direction of gains, with more correct actions surfaced while fault rates stay flat. Using real exercise data instead of clean benchmarks is a practical strength for this domain.

The soft spots sit in the evaluation. The abstract states the percentage lifts but supplies no definitions for reasoning-recall or ticket-action recall, no details on baseline implementations, no statistical tests, and no ablations that isolate the graph mechanisms from simply feeding the LLM more structured text. Without those, the claim that the dual-layer KG is responsible remains unverified, exactly as the stress-test note flags. Graph fidelity to the domain is assumed rather than demonstrated.

This work is aimed at researchers building grounded LLM assistants for security operations. A reader who needs applied examples of KG augmentation in high-stakes, time-sensitive settings will find usable ideas here. It deserves peer review because the empirical setting is relevant and the architecture is described at a level that can be examined, even though the current methods section will require substantial expansion and controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces DEFENGRAPH, an LLM-driven assistant for blue-team cyber defense that augments models via a dual-layer Static-Dynamic Knowledge Graph (KG) built from SIEM alerts, system topology, attacker behaviors, and prior actions. Graph path retrieval, LLM contextual filtering, and reasoning-based re-ranking are used to ground outputs in long-term knowledge and evolving context. On data from live Red-vs-Blue cyber-range exercises, the system is reported to raise reasoning-recall from 61.45% to 73.49% and ticket-action recall from 52.17% to 72.46% (precision 24.49% to 29.24%) on GPT-4o, with analogous gains on LLaMA-3, DeepSeek-R1 and QWen-3, while surfacing up to 50 correct actions versus 36 for the next baseline and keeping fault rates steady.

Significance. If the empirical claims are substantiated with full methodological detail, the work would offer a concrete demonstration that dual-layer KGs can measurably improve LLM reliability for time-sensitive defensive decision support on realistic, noisy security data. The multi-model evaluation and use of live exercise traces are strengths that increase external validity relative to purely synthetic benchmarks.

major comments (3)

[Abstract and Evaluation section] Abstract and Evaluation section: the concrete lifts (reasoning-recall 61.45%→73.49%, ticket-action recall 52.17%→72.46% on GPT-4o) are stated without definitions of the recall/precision metrics, descriptions of baseline implementations, statistical significance tests, or data-exclusion rules applied to the cyber-range dataset. These omissions are load-bearing for any claim that the gains are attributable to the proposed mechanisms rather than confounds or metric choices.
[Methodology / KG construction section] Methodology / KG construction section: the central attribution of performance gains to the Static-Dynamic KG plus path retrieval, filtering, and re-ranking rests on the unverified assumption that the constructed graphs faithfully encode both long-term domain knowledge and evolving event context. No coverage metrics, expert validation of completeness, or ablation isolating graph structure from raw-text augmentation are supplied, leaving open the possibility that improvements arise simply from additional structured context.
[Results table] Results table (action-surfacing numbers): the claim of surfacing up to 50 correct defense actions versus 36 for the next baseline is presented without variance estimates, per-exercise breakdowns, or confirmation that the same set of ground-truth actions was used for all systems, undermining the cross-baseline comparison.

minor comments (2)

[Figure 1 / §3] The notation distinguishing static versus dynamic layers in the KG could be made more explicit, ideally with an accompanying diagram that labels edge types and temporal scopes.
[Related Work] A small number of recent KG-augmented LLM papers in the security domain are not referenced in the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript's transparency without altering its core claims.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the concrete lifts (reasoning-recall 61.45%→73.49%, ticket-action recall 52.17%→72.46% on GPT-4o) are stated without definitions of the recall/precision metrics, descriptions of baseline implementations, statistical significance tests, or data-exclusion rules applied to the cyber-range dataset. These omissions are load-bearing for any claim that the gains are attributable to the proposed mechanisms rather than confounds or metric choices.

Authors: We agree these details are essential for rigorous interpretation. The revised manuscript will add explicit definitions of reasoning-recall and ticket-action recall (including how ground-truth positives are identified from the annotated cyber-range traces), full descriptions of baseline implementations, results from statistical significance tests such as McNemar's test, and a statement of any data-exclusion rules. These additions will be placed in a new subsection of the Evaluation section. revision: yes
Referee: [Methodology / KG construction section] Methodology / KG construction section: the central attribution of performance gains to the Static-Dynamic KG plus path retrieval, filtering, and re-ranking rests on the unverified assumption that the constructed graphs faithfully encode both long-term domain knowledge and evolving event context. No coverage metrics, expert validation of completeness, or ablation isolating graph structure from raw-text augmentation are supplied, leaving open the possibility that improvements arise simply from additional structured context.

Authors: The manuscript details KG construction from SIEM alerts, topology, attacker behaviors, and prior actions, with consistent gains across four LLMs supporting the value of the structured dual-layer approach. We will add quantitative coverage metrics for both Static and Dynamic layers. Formal expert validation of completeness was not performed in the original study; we will note this limitation explicitly. A full ablation isolating graph structure from raw-text context was not conducted; we will either add a targeted ablation where feasible or discuss it as future work while maintaining that path retrieval and re-ranking provide benefits beyond unstructured augmentation. revision: partial
Referee: [Results table] Results table (action-surfacing numbers): the claim of surfacing up to 50 correct defense actions versus 36 for the next baseline is presented without variance estimates, per-exercise breakdowns, or confirmation that the same set of ground-truth actions was used for all systems, undermining the cross-baseline comparison.

Authors: The 50 versus 36 figures represent the maximum correct actions surfaced across the set of exercises. The revision will include variance estimates (standard deviation across exercises), per-exercise breakdowns in an appendix, and an explicit statement of the evaluation protocol confirming that identical ground-truth action sets—derived from the same annotated traces—were used for every system and baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation rests on external baselines and data

full rationale

The paper describes an empirical system (DEFENGRAPH) that constructs dual-layer KGs from SIEM alerts, topology, attacker behaviors and prior actions collected during cyber-range exercises, then measures LLM recall/precision gains against external baselines on the same held-out exercise data. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described methodology. The reported lifts (e.g., GPT-4o reasoning-recall 61.45% → 73.49%) are presented as direct experimental outcomes, not derived quantities that reduce to the input construction by definition. The framework is therefore self-contained against the external benchmarks it reports.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, new physical entities, or ad-hoc axioms are stated. The work rests on the domain assumption that heterogeneous security artifacts can be assembled into faithful static and dynamic graphs that improve LLM outputs.

axioms (1)

domain assumption Heterogeneous security artifacts (SIEM alerts, topology, attacker behaviors, prior actions) can be assembled into static and dynamic knowledge graphs that faithfully capture both long-term domain knowledge and evolving event context.
Central premise invoked to justify the dual-layer KG construction and its use for grounding LLM outputs.

pith-pipeline@v0.9.1-grok · 5870 in / 1429 out tokens · 35265 ms · 2026-06-26T14:09:20.639339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Stop overthinking: A survey on efficient reasoning for large language models,

Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16419

Pith/arXiv arXiv 2025
[2]

Sok: Semantic privacy in large language models,

B. Ma, Y . Jiang, X. Wang, G. Yu, Q. Wang, C. Sun, C. Li, X. Qi, Y . He, W. Ni, and R. P. Liu, “Sok: Semantic privacy in large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2506.23603

arXiv 2025
[3]

When llms meet cybersecurity: a systematic literature review,

J. Zhang, H. Bu, H. Wen, Y . Liu, H. Fei, R. Xi, L. Li, Y . Yang, H. Zhu, and D. Meng, “When llms meet cybersecurity: a systematic literature review,”Cybersecurity, vol. 8, no. 1, p. 55, Feb 2025. [Online]. Available: https://doi.org/10.1186/s42400-025-00361-w

work page doi:10.1186/s42400-025-00361-w 2025
[4]

ACM Trans

H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”ACM Trans. Softw. Eng. Methodol., Sep. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3769676

work page doi:10.1145/3769676 2025
[5]

Accountability and reliability in 6g o-ran: Who is responsible when it fails?

Y . He, G. Yu, X. Wang, Q. Wang, Z. Niu, W. Ni, and R. P. Liu, “Accountability and reliability in 6g o-ran: Who is responsible when it fails?”IEEE Wireless Communications, vol. 32, no. 2, pp. 52–59, 2025

2025
[6]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., vol. 43, no. 2, Jan. 2025. [Online]. Available: https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[7]

Intellbot: Retrieval augmented llm chatbot for cyber threat knowledge delivery,

D. R. Arikkat, A. M., N. Binu, P. M., N. Biju, K. S. Arunima, V . P., R. R. K. A., and M. Conti, “Intellbot: Retrieval augmented llm chatbot for cyber threat knowledge delivery,” 2024. [Online]. Available: https://arxiv.org/abs/2411.05442

arXiv 2024
[8]

Security and threat detection through cloud-based wazuh deployment,

S. Moiz, A. Majid, A. Basit, M. Ebrahim, A. A. Abro, and M. Naeem, “Security and threat detection through cloud-based wazuh deployment,” in2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC), 2024, pp. 1–5

2024
[9]

Enhancing security operations center: Wazuh security event response with retrieval-augmented-generation- driven copilot,

Ismail, R. Kurnia, F. Widyatama, I. M. Wibawa, Z. A. Brata, Ukasyah, G. A. Nelistiani, and H. Kim, “Enhancing security operations center: Wazuh security event response with retrieval-augmented-generation- driven copilot,”Sensors, vol. 25, no. 3, 2025. [Online]. Available: https://www.mdpi.com/1424-8220/25/3/870

2025
[10]

Dense passage retrieval for open-domain question answering,

V . Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 6...

2020
[11]

A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,

W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,” inACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024

2024
[12]

Context- aware prompting for llm-based program repair,

Y . Li, M. Cai, J. Chen, Y . Xu, L. Huang, and J. Li, “Context- aware prompting for llm-based program repair,”Automated Software Engineering (ASE), 2025

2025
[13]

Make your LLM fully utilize the context,

S. An, Z. Ma, Z. Lin, N. Zheng, J.-G. Lou, and W. Chen, “Make your LLM fully utilize the context,”Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[14]

Machine learning on knowledge graphs for context-aware security monitoring,

J. S. Garrido, D. Dold, and J. Frank, “Machine learning on knowledge graphs for context-aware security monitoring,” inIEEE International Conference on Cyber Security and Resilience (CSR), 2021

2021
[15]

From local to global: A graph rag approach to query-focused summarization,

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From local to global: A graph rag approach to query-focused summarization,”arXiv preprint arXiv:2404.16130, 2024

Pith/arXiv arXiv 2024
[16]

Sub- graph retrieval enhanced model for multi-hop knowledge base question answering,

J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li, and H. Chen, “Sub- graph retrieval enhanced model for multi-hop knowledge base question answering,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

2022
[17]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,”CoRR, vol. abs/2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[18]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhriet al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024
[19]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12948

Pith/arXiv arXiv 2025
[20]

Qwen3 technical report,

A. Yang, A. Li, B. Yang,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[21]

A hybrid RAG system with comprehensive enhancement on complex reasoning,

Y . Yuan, C. Liu, J. Yuan, G. Sun, S. Li, and M. Zhang, “A hybrid RAG system with comprehensive enhancement on complex reasoning,”arXiv preprint arXiv:2408.05141, 2024

arXiv 2024
[22]

spaCy: Industrial-strength Natural Language Processing in Python,

M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python,”https:// spacy.io/, 2020

2020
[23]

Direct evaluation of chain-of-thought in multi-hop reasoning with knowledge graphs,

T. Nguyen, L. Luo, F. Shiri, D. Phung, Y .-F. Li, T. Vu, and G. Haffari, “Direct evaluation of chain-of-thought in multi-hop reasoning with knowledge graphs,” inAssociation for Computational Linguistics (ACL), 2024

2024
[24]

Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation,

M. Li, S. Miao, and P. Li, “Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation,” inInternational Conference on Learning Representations, 2025

2025
[25]

Improving multi-hop knowledge base question answering by learning intermediate supervision signals,

G. He, Y . Lan, J. Jiang, W. X. Zhao, and J.-R. Wen, “Improving multi-hop knowledge base question answering by learning intermediate supervision signals,”ACM International Conference on Web Search and Data Mining (WSDM), 2021

2021
[26]

Large language models enhanced collaborative filtering,

Z. Sun, Z. Si, X. Zang, K. Zheng, Y . Song, X. Zhang, and J. Xu, “Large language models enhanced collaborative filtering,” inACM International Conference on Information and Knowledge Management (CIKM), 2024

2024
[27]

Think-then- react: Towards unconstrained human action-to-reaction generation,

W. Tan, B. Li, C. Jin, W. Huang, X. Wang, and R. Song, “Think-then- react: Towards unconstrained human action-to-reaction generation,”
[28]

Available: https://arxiv.org/abs/2503.16451

[Online]. Available: https://arxiv.org/abs/2503.16451

arXiv
[29]

Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented LLMs,

G. Agrawal, K. Pal, Y . Deng, H. Liu, and Y .-C. Chen, “Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented LLMs,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024

2024
[30]

Cyber security knowledge graph based cyber attack attribution framework for space-ground in- tegration information network,

Z. Zhu, R. Jiang, Y . Jia, J. Xu, and A. Li, “Cyber security knowledge graph based cyber attack attribution framework for space-ground in- tegration information network,” inIEEE International Conference on Communication Technology (ICCT), 2018

2018
[31]

Cskg4apt: A cy- bersecurity knowledge graph for advanced persistent threat organization attribution,

Y . Ren, Y . Xiao, Y . Zhou, Z. Zhang, and Z. Tian, “Cskg4apt: A cy- bersecurity knowledge graph for advanced persistent threat organization attribution,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5695–5709, 2023

2023
[32]

Ctinexus: Automatic cyber threat intelligence knowledge graph construction using large language models,

Y . Cheng, O. Bajaber, S. A. Tsegai, D. Song, and P. Gao, “Ctinexus: Automatic cyber threat intelligence knowledge graph construction using large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2410.21060

arXiv 2025
[33]

Attackg+: Boosting attack graph construction with large language models,

Y . Zhang, T. Du, Y . Ma, X. Wang, Y . Xie, G. Yang, Y . Lu, and E.-C. Chang, “Attackg+: Boosting attack graph construction with large language models,”Computers & Security, vol. 150, p. 104220, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167404824005261

2025
[34]

Actionable cyber threat in- telligence using knowledge graphs and large language models,

R. Fieblinger, M. T. Alam, and N. Rastogi, “Actionable cyber threat in- telligence using knowledge graphs and large language models,” inIEEE European symposium on security and privacy workshops (EuroS&PW), 2024

2024
[35]

Kg-ibl: Knowledge graph driven incremental broad learning for few-shot specific emitter identification,

M. Hua, Y . Zhang, Q. Zhang, H. Tang, L. Guo, Y . Lin, H. Sari, and G. Gui, “Kg-ibl: Knowledge graph driven incremental broad learning for few-shot specific emitter identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 10 016–10 028, 2024

2024
[36]

Wrongdoing monitor: A graph-based behavioral anomaly detection in cyber security,

C. Wang and H. Zhu, “Wrongdoing monitor: A graph-based behavioral anomaly detection in cyber security,”IEEE Transactions on Information Forensics and Security, vol. 17, pp. 2703–2718, 2022

2022
[37]

K-getnid: Knowledge-guided graphs for early and transferable network intrusion detection,

M. Wang, N. Yang, and N. Weng, “K-getnid: Knowledge-guided graphs for early and transferable network intrusion detection,”IEEE Transactions on Information Forensics and Security, vol. 19, p. 7147–7160, Jan. 2024. [Online]. Available: https://doi.org/10.1109/ TIFS.2024.3431932

arXiv 2024
[38]

Knowledge graph reasoning for cyber attack detection,

E. Gilliard, J. Liu, and A. A. Aliyu, “Knowledge graph reasoning for cyber attack detection,”IET Communications, vol. 18, no. 4, p. 297–308, Feb. 2024. [Online]. Available: https://doi.org/10.1049/cmu2.12736

work page doi:10.1049/cmu2.12736 2024

[1] [1]

Stop overthinking: A survey on efficient reasoning for large language models,

Y . Sui, Y .-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16419

Pith/arXiv arXiv 2025

[2] [2]

Sok: Semantic privacy in large language models,

B. Ma, Y . Jiang, X. Wang, G. Yu, Q. Wang, C. Sun, C. Li, X. Qi, Y . He, W. Ni, and R. P. Liu, “Sok: Semantic privacy in large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2506.23603

arXiv 2025

[3] [3]

When llms meet cybersecurity: a systematic literature review,

J. Zhang, H. Bu, H. Wen, Y . Liu, H. Fei, R. Xi, L. Li, Y . Yang, H. Zhu, and D. Meng, “When llms meet cybersecurity: a systematic literature review,”Cybersecurity, vol. 8, no. 1, p. 55, Feb 2025. [Online]. Available: https://doi.org/10.1186/s42400-025-00361-w

work page doi:10.1186/s42400-025-00361-w 2025

[4] [4]

ACM Trans

H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”ACM Trans. Softw. Eng. Methodol., Sep. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3769676

work page doi:10.1145/3769676 2025

[5] [5]

Accountability and reliability in 6g o-ran: Who is responsible when it fails?

Y . He, G. Yu, X. Wang, Q. Wang, Z. Niu, W. Ni, and R. P. Liu, “Accountability and reliability in 6g o-ran: Who is responsible when it fails?”IEEE Wireless Communications, vol. 32, no. 2, pp. 52–59, 2025

2025

[6] [6]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., vol. 43, no. 2, Jan. 2025. [Online]. Available: https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025

[7] [7]

Intellbot: Retrieval augmented llm chatbot for cyber threat knowledge delivery,

D. R. Arikkat, A. M., N. Binu, P. M., N. Biju, K. S. Arunima, V . P., R. R. K. A., and M. Conti, “Intellbot: Retrieval augmented llm chatbot for cyber threat knowledge delivery,” 2024. [Online]. Available: https://arxiv.org/abs/2411.05442

arXiv 2024

[8] [8]

Security and threat detection through cloud-based wazuh deployment,

S. Moiz, A. Majid, A. Basit, M. Ebrahim, A. A. Abro, and M. Naeem, “Security and threat detection through cloud-based wazuh deployment,” in2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC), 2024, pp. 1–5

2024

[9] [9]

Enhancing security operations center: Wazuh security event response with retrieval-augmented-generation- driven copilot,

Ismail, R. Kurnia, F. Widyatama, I. M. Wibawa, Z. A. Brata, Ukasyah, G. A. Nelistiani, and H. Kim, “Enhancing security operations center: Wazuh security event response with retrieval-augmented-generation- driven copilot,”Sensors, vol. 25, no. 3, 2025. [Online]. Available: https://www.mdpi.com/1424-8220/25/3/870

2025

[10] [10]

Dense passage retrieval for open-domain question answering,

V . Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 6...

2020

[11] [11]

A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,

W. Fan, Y . Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,” inACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024

2024

[12] [12]

Context- aware prompting for llm-based program repair,

Y . Li, M. Cai, J. Chen, Y . Xu, L. Huang, and J. Li, “Context- aware prompting for llm-based program repair,”Automated Software Engineering (ASE), 2025

2025

[13] [13]

Make your LLM fully utilize the context,

S. An, Z. Ma, Z. Lin, N. Zheng, J.-G. Lou, and W. Chen, “Make your LLM fully utilize the context,”Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[14] [14]

Machine learning on knowledge graphs for context-aware security monitoring,

J. S. Garrido, D. Dold, and J. Frank, “Machine learning on knowledge graphs for context-aware security monitoring,” inIEEE International Conference on Cyber Security and Resilience (CSR), 2021

2021

[15] [15]

From local to global: A graph rag approach to query-focused summarization,

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From local to global: A graph rag approach to query-focused summarization,”arXiv preprint arXiv:2404.16130, 2024

Pith/arXiv arXiv 2024

[16] [16]

Sub- graph retrieval enhanced model for multi-hop knowledge base question answering,

J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li, and H. Chen, “Sub- graph retrieval enhanced model for multi-hop knowledge base question answering,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

2022

[17] [17]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,”CoRR, vol. abs/2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[18] [18]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhriet al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024

[19] [19]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12948

Pith/arXiv arXiv 2025

[20] [20]

Qwen3 technical report,

A. Yang, A. Li, B. Yang,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[21] [21]

A hybrid RAG system with comprehensive enhancement on complex reasoning,

Y . Yuan, C. Liu, J. Yuan, G. Sun, S. Li, and M. Zhang, “A hybrid RAG system with comprehensive enhancement on complex reasoning,”arXiv preprint arXiv:2408.05141, 2024

arXiv 2024

[22] [22]

spaCy: Industrial-strength Natural Language Processing in Python,

M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python,”https:// spacy.io/, 2020

2020

[23] [23]

Direct evaluation of chain-of-thought in multi-hop reasoning with knowledge graphs,

T. Nguyen, L. Luo, F. Shiri, D. Phung, Y .-F. Li, T. Vu, and G. Haffari, “Direct evaluation of chain-of-thought in multi-hop reasoning with knowledge graphs,” inAssociation for Computational Linguistics (ACL), 2024

2024

[24] [24]

Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation,

M. Li, S. Miao, and P. Li, “Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented generation,” inInternational Conference on Learning Representations, 2025

2025

[25] [25]

Improving multi-hop knowledge base question answering by learning intermediate supervision signals,

G. He, Y . Lan, J. Jiang, W. X. Zhao, and J.-R. Wen, “Improving multi-hop knowledge base question answering by learning intermediate supervision signals,”ACM International Conference on Web Search and Data Mining (WSDM), 2021

2021

[26] [26]

Large language models enhanced collaborative filtering,

Z. Sun, Z. Si, X. Zang, K. Zheng, Y . Song, X. Zhang, and J. Xu, “Large language models enhanced collaborative filtering,” inACM International Conference on Information and Knowledge Management (CIKM), 2024

2024

[27] [27]

Think-then- react: Towards unconstrained human action-to-reaction generation,

W. Tan, B. Li, C. Jin, W. Huang, X. Wang, and R. Song, “Think-then- react: Towards unconstrained human action-to-reaction generation,”

[28] [28]

Available: https://arxiv.org/abs/2503.16451

[Online]. Available: https://arxiv.org/abs/2503.16451

arXiv

[29] [29]

Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented LLMs,

G. Agrawal, K. Pal, Y . Deng, H. Liu, and Y .-C. Chen, “Cyberq: Generat- ing questions and answers for cybersecurity education using knowledge graph-augmented LLMs,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024

2024

[30] [30]

Cyber security knowledge graph based cyber attack attribution framework for space-ground in- tegration information network,

Z. Zhu, R. Jiang, Y . Jia, J. Xu, and A. Li, “Cyber security knowledge graph based cyber attack attribution framework for space-ground in- tegration information network,” inIEEE International Conference on Communication Technology (ICCT), 2018

2018

[31] [31]

Cskg4apt: A cy- bersecurity knowledge graph for advanced persistent threat organization attribution,

Y . Ren, Y . Xiao, Y . Zhou, Z. Zhang, and Z. Tian, “Cskg4apt: A cy- bersecurity knowledge graph for advanced persistent threat organization attribution,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5695–5709, 2023

2023

[32] [32]

Ctinexus: Automatic cyber threat intelligence knowledge graph construction using large language models,

Y . Cheng, O. Bajaber, S. A. Tsegai, D. Song, and P. Gao, “Ctinexus: Automatic cyber threat intelligence knowledge graph construction using large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2410.21060

arXiv 2025

[33] [33]

Attackg+: Boosting attack graph construction with large language models,

Y . Zhang, T. Du, Y . Ma, X. Wang, Y . Xie, G. Yang, Y . Lu, and E.-C. Chang, “Attackg+: Boosting attack graph construction with large language models,”Computers & Security, vol. 150, p. 104220, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167404824005261

2025

[34] [34]

Actionable cyber threat in- telligence using knowledge graphs and large language models,

R. Fieblinger, M. T. Alam, and N. Rastogi, “Actionable cyber threat in- telligence using knowledge graphs and large language models,” inIEEE European symposium on security and privacy workshops (EuroS&PW), 2024

2024

[35] [35]

Kg-ibl: Knowledge graph driven incremental broad learning for few-shot specific emitter identification,

M. Hua, Y . Zhang, Q. Zhang, H. Tang, L. Guo, Y . Lin, H. Sari, and G. Gui, “Kg-ibl: Knowledge graph driven incremental broad learning for few-shot specific emitter identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 10 016–10 028, 2024

2024

[36] [36]

Wrongdoing monitor: A graph-based behavioral anomaly detection in cyber security,

C. Wang and H. Zhu, “Wrongdoing monitor: A graph-based behavioral anomaly detection in cyber security,”IEEE Transactions on Information Forensics and Security, vol. 17, pp. 2703–2718, 2022

2022

[37] [37]

K-getnid: Knowledge-guided graphs for early and transferable network intrusion detection,

M. Wang, N. Yang, and N. Weng, “K-getnid: Knowledge-guided graphs for early and transferable network intrusion detection,”IEEE Transactions on Information Forensics and Security, vol. 19, p. 7147–7160, Jan. 2024. [Online]. Available: https://doi.org/10.1109/ TIFS.2024.3431932

arXiv 2024

[38] [38]

Knowledge graph reasoning for cyber attack detection,

E. Gilliard, J. Liu, and A. A. Aliyu, “Knowledge graph reasoning for cyber attack detection,”IET Communications, vol. 18, no. 4, p. 297–308, Feb. 2024. [Online]. Available: https://doi.org/10.1049/cmu2.12736

work page doi:10.1049/cmu2.12736 2024