pith. sign in

arxiv: 2605.16714 · v1 · pith:QMJXUZGWnew · submitted 2026-05-15 · 💻 cs.AI · cs.CR

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

Pith reviewed 2026-05-20 17:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords security knowledge graphscyber threat intelligenceknowledge graph constructiontask bank rewardsLLM graph extractionCTI analysisreinforcement learning for extraction
0
0 comments X

The pith

GRID turns CTI text into security knowledge graphs by training on a task bank of multiple-choice questions and regex targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GRID as a way to build computable security knowledge graphs from long cyber threat intelligence articles. It first creates traceable alignments between texts and graphs through extraction followed by knowledge-graph-conditioned revision. These alignments become supervision for a task bank that converts the full document-to-graph problem into four-option multi-select questions with triple-level regex matching targets. Models trained with the resulting task-bank rewards reach higher source-averaged recall and comparable F1 scores than end-to-end LLM judges while using fewer tokens. The approach also allows the rewards to be precomputed once and reused across multiple training runs.

Core claim

GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. The Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1 on 249 CTI articles from five sources, while the End2End Reward model reaches 76.91% prec,

What carries the argument

The Task-bank Reward model trained on a scripted bank of four-option multi-select questions paired with triple-level regex matching targets derived from offline article-graph alignments.

If this is right

  • Task-bank rewards can be built once offline and reused across later post-training runs.
  • The Task-bank Reward model achieves the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost.
  • The End2End Reward model with LLM-as-judge rewards reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1.
  • Choice-only Reward and End2End SFT without RL produce weaker results than the task-bank approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Precomputed task banks could be shared across research groups to train security-specific extractors without repeating expensive alignment steps.
  • The same offline-to-online reward pattern might apply to knowledge graph construction in other narrow technical domains such as biomedical literature.
  • If the alignments remain stable, the method could support iterative refinement where new articles update an existing graph without full retraining.

Load-bearing premise

The graph extraction and knowledge-graph-conditioned text revision steps produce sufficiently accurate and unbiased article-graph alignments to serve as reliable supervision targets for the downstream task-bank training.

What would settle it

Running the Task-bank Reward model on a fresh collection of CTI articles outside the original 249 and finding that its source-averaged recall falls below the End2End model's recall would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2605.16714 by Fei Shao, Liangyi Huang, Mengshi Zhang, Shang Ma, Xusheng Xiao, Yanfang Ye, Zichen Liu, Zihao Chen.

Figure 1
Figure 1. Figure 1: Log4Shell (CVE-2021-44228) CTI articles: knowledge graph (left) and attack [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GRID Algorithm 1: Automatic Annotation of Article-Graph Alignment Data Input: Raw CTI article a, traceable extraction prompt Ptrace, revision prompt Prev Output: Article-graph alignment (a ′ , G ′ ) consisting of revised article a ′ and text-grounded knowledge graph G ′ 1 G ← LLMExtract(a, Ptrace) 2 Parse G into entity list E and relation list R 3 foreach r ∈ R do 4 keep sentence-local subject/… view at source ↗
Figure 3
Figure 3. Figure 3: RQ3 ablation curves up to 225 steps. Left: RL training reward (first-order EMA, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt blocks for the GRID extractor C Judge Rules for Automatic Evaluation The automatic evaluator scores precision and recall under an explicit rule set rather than unconstrained semantic similarity [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents GRID, an end-to-end framework for constructing security knowledge graphs from long-form CTI articles. It first generates supervision via graph extraction and knowledge-graph-conditioned text revision to produce traceable article-graph alignments, then converts document-to-graph learning into a task bank of four-option multi-select questions paired with triple-level regex matching. Two Qwen3-4B-based models are trained: a primary Task-bank Reward model and a secondary End2End Reward model using LLM-as-judge rewards. On 249 articles drawn from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank model reports 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1 while using fewer tokens; the End2End model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Additional analyses claim that task-bank rewards can be precomputed offline and reused.

Significance. If the pipeline-generated alignments prove reliable, the work supplies a concrete, lower-cost alternative to repeated LLM-as-judge scoring for supervising security KG extraction. The offline reusability of the task-bank rewards and the reporting of source-averaged metrics across five external datasets are practical strengths that could aid reproducibility in the security-AI community.

major comments (1)
  1. [Abstract] Abstract and supervision pipeline description: the central performance numbers (84.62% precision, 64.91% recall) are obtained by training on targets produced by the paper's own graph extraction plus KG-conditioned text revision steps. No human validation, inter-annotator agreement scores, or error analysis of these alignments is reported, so the metrics may largely reflect consistency with the pipeline rather than independent ground-truth quality.
minor comments (2)
  1. The abstract states that the Task-bank model achieves 'lower token usage and deployment cost' but supplies no quantitative token counts, latency figures, or cost comparisons against the baselines.
  2. Details on data splits, baseline implementations, statistical significance testing, and per-source error analysis are absent from the abstract and would strengthen verifiability of the cross-dataset claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the supervision pipeline and its implications for the reported metrics. We address the concern point by point below, acknowledging the limitation while clarifying the design rationale and proposed revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and supervision pipeline description: the central performance numbers (84.62% precision, 64.91% recall) are obtained by training on targets produced by the paper's own graph extraction plus KG-conditioned text revision steps. No human validation, inter-annotator agreement scores, or error analysis of these alignments is reported, so the metrics may largely reflect consistency with the pipeline rather than independent ground-truth quality.

    Authors: We agree that the performance numbers measure consistency with targets generated by our graph extraction and KG-conditioned text revision pipeline, rather than against independently human-annotated ground truth. This automated approach was chosen to produce scalable, traceable article-graph alignments without the prohibitive cost of manual security KG annotation. The evaluation uses source-averaged metrics across 249 articles from five external datasets (GRID, CASIE, CTINexus, MalKG, SecureNLP) to demonstrate generalization beyond the training sources. We will revise the manuscript to add an explicit limitations subsection discussing the pipeline-generated supervision, include a preliminary error analysis of alignment quality on a held-out sample, and note the absence of inter-annotator agreement as a point for future human-expert validation. These changes will not alter the core claims regarding task-bank reward stability and offline reusability. revision: partial

Circularity Check

0 steps flagged

No significant circularity; performance on held-out external datasets

full rationale

The paper generates article-graph alignments via its graph extraction and KG-conditioned revision pipeline to create supervision targets for task-bank training. It then reports precision/recall/F1 for the trained Task-bank Reward model on 249 held-out CTI articles drawn from external sources (CASIE, CTINexus, MalKG, SecureNLP) plus its own GRID collection. No equations, definitions, or self-citations are shown that reduce the reported source-averaged scores to quantities fitted or generated within the same run. The evaluation therefore retains independent empirical content against external benchmarks rather than collapsing to an in-sample consistency check. This warrants a low circularity score; concerns about unvalidated alignment quality belong to correctness risk, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about LLM fine-tuning and the availability of an ontology for guidance; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption LLMs can be improved for structured extraction tasks via reinforcement learning with task-specific rewards derived from question answering and pattern matching
    Invoked when the paper states that the task-bank approach yields more stable rewards than LLM-as-judge scoring.

pith-pipeline@v0.9.0 · 5881 in / 1225 out tokens · 93099 ms · 2026-05-20T17:18:32.239499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    GRID first constructs security-domain supervision from security-related CTI articles in an unsupervised manner by constructing traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then reformulates document-to-graph learning into a scripted task bank that combines four-option multi-select questions with triple-level regex matching targets

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The ontology is not used as a post-hoc label list. Instead, it guides the LLM to recover entity relations that are not directly stated through explicit verbs or relative clauses, by providing typed entity categories, normalized relation families, aliases, and hierarchy constraints.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

  2. [2]

    GitHub repository

    URL https://github.com/topoteretes/ cognee. GitHub repository. Conference on Language Modeling. COLM 2026: Call for papers. https://colmweb. org/cfp.html,

  3. [3]

    The MITRE Corporation

    Accessed: 2026-03-29. The MITRE Corporation. Mitre att&ck,

  4. [4]

    CrowdStrike

    https://attack.mitre.org/. CrowdStrike. Crowdstrike 2024 global threat report,

  5. [5]

    Ying Dong, Wenbo Guo, Yueqi Chen, Xinyu Xing, Yuqing Zhang, and Gang Wang

    Published: 2025-12-10; accessed: 2026-03-29. Ying Dong, Wenbo Guo, Yueqi Chen, Xinyu Xing, Yuqing Zhang, and Gang Wang. Towards the detection of inconsistencies in public security vulnerability reports. In28th USENIX security symposium (USENIX Security 19), pp. 869–885,

  6. [6]

    Soft Adaptive Policy Optimization

    10 Preprint. Under review. Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,

  7. [7]

    Threatkg: A threat knowledge graph for automated open-source cyber threat intelligence gathering and management,

    Peng Gao, Xiaoyuan Liu, Edward Choi, Sibo Ma, Xinyu Yang, Zhengjie Ji, Zilin Zhang, and Dawn Song. Threatkg: A threat knowledge graph for automated open-source cyber threat intelligence gathering and management.arXiv preprint arXiv:2212.10388,

  8. [8]

    REBEL: Relation extraction by end-to-end lan- guage generation

    Pere-Llu´ıs Huguet Cabot and Roberto Navigli. REBEL: Relation extraction by end-to-end lan- guage generation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 2370–2381, Punta Cana, Dominican Republic, November

  9. [9]

    URL https://aclanthology.org/2021.findings-emnlp

    Association for Com- putational Linguistics. URL https://aclanthology.org/2021.findings-emnlp

  10. [10]

    Acing the ioc game: Toward automatic discovery and analysis of open-source cyber threat intel- ligence

    Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. Acing the ioc game: Toward automatic discovery and analysis of open-source cyber threat intel- ligence. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 755–766,

  11. [11]

    URL https://aclanthology.org/P17-1143/

    doi: 10.18653/v1/P17-1143. URL https://aclanthology.org/P17-1143/. Kai Liu, Yi Wang, Zhaoyun Ding, Aiping Li, and Weiming Zhang. Bvted: A specialized bilingual (chinese–english) dataset for vulnerability triple extraction tasks.Applied Sciences, 14(16):7310,

  12. [12]

    https://www.gartner.com/doc/2487216/definition- threat-intelligence. Meta. Llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct, July 2024a. Released: 2024-07-23; accessed: 2026-03-29. Meta. Llama-3.2-3b. https://huggingface.co/meta-llama/Llama-3.2-3B , September 2024b. Released: 2024-09-25, accessed: 2026-03-29. Trend Micro. Ra...

  13. [13]

    Office of the National Cyber Director

    https://nvd.nist.gov/. Office of the National Cyber Director. 2024 report on the cy- bersecurity posture of the united states,

  14. [14]

    URL https: //www.whitehouse.gov/wp-content/uploads/2024/05/ 2024-Report-on-the-Cybersecurity-Posture-of-the-United-States. pdf. OpenAI. Chatgpt: Applications, opportunities, and threats.arXiv preprint arXiv:2304.09103,

  15. [15]

    Semeval-2018 task 8: Semantic extraction from cybersecurity reports using natural language processing (SecureNLP)

    Peter Phandi, Amila Silva, and Wei Lu. Semeval-2018 task 8: Semantic extraction from cybersecurity reports using natural language processing (SecureNLP). InPro- ceedings of the 12th International Workshop on Semantic Evaluation, pp. 697–706. Associ- ation for Computational Linguistics,

  16. [16]

    Qwen3 Technical Report

    doi: 10.18653/v1/S18-1113. URL https: //aclanthology.org/S18-1113/. Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. doi: 10.48550/ arXiv.2505.09388. URLhttps://arxiv.org/abs/2505.09388. Qwen Team. Qwen3-4b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-4B-Instruct-2507, 2025b. Accessed: 2026-03-29. Preston Rasmussen, Pavlo Pal...

  17. [17]

    Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Nandana Mihindukulasooriya, Owen Cornec, and Alfio Massimiliano Gliozzo

    URL https://arxiv.org/ abs/2102.05571. Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Nandana Mihindukulasooriya, Owen Cornec, and Alfio Massimiliano Gliozzo. Knowgl: Knowledge generation and linking from text. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 16476–16478,

  18. [18]

    HybridFlow: A Flexible and Efficient RLHF Framework

    https://www.recordedfuture.com/. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256,

  19. [19]

    arXiv preprint arXiv:2307.07697 , year=

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and respon- sible reasoning of large language model on knowledge graph, 2024.URL https://arxiv. org/abs/2307.07697, 2023a. Siqi Sun, Cheng Huang, Tiejun Wu, and Yi Shen. Sectkg: A knowledge graph for open-source ...

  20. [20]

    Unit42 by Palo Alto Networks

    http://www.nytimes.com/2014/02/27/business/target-reports-on-fourth-quarter- earnings.html? r=1. Unit42 by Palo Alto Networks. How the eitest campaigns path to angler ek evolved over time,

  21. [21]

    Y. Wang, Y. Zhang, Y. Li, and X. Liu. A bibliometric review of large language models research from 2017 to 2023.arXiv preprint arXiv:2304.02020,

  22. [22]

    Automated attack knowledge graph construction with large language models

    Zhihua Wang, Siyuan Fei, Youlin Hu, Dacheng Shan, Shitao Xiao, Lizhao You, and Peijun Chen. Automated attack knowledge graph construction with large language models. In Proceedings of the 2025 2nd International Conference on Computer and Multimedia Technology, pp. 700–706,

  23. [23]

    Primus: A pioneering collection of open-source datasets for cybersecurity llm training.arXiv preprint arXiv:2502.11191,

    Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, and Wen-Kwang Tsao. Primus: A pioneering collection of open-source datasets for cybersecurity llm training.arXiv preprint arXiv:2502.11191,

  24. [24]

    Under review

    14 Preprint. Under review. Table 6: Entity types in the GRID ontology user-account identity threat-actor- or-intrusion-set campaign malware hacker-tool general-software security-product detailed-part-of- malware-or- hackertool detailed-part-of- general-software attack-pattern vulnerability file process windows- registry-key course-of-action url domain-nam...