Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction
Pith reviewed 2026-05-18 19:23 UTC · model grok-4.3
The pith
Dynamically constructing knowledge graphs at inference time improves factual accuracy in large language models by structuring internal and external knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By extracting a seed knowledge graph from the question via prompting, iteratively expanding it with the LLM's internal knowledge, and then selectively refining it through external retrieval, the framework enhances factual coverage, corrects inaccuracies, and produces more consistent answers than methods that treat knowledge as unstructured text.
What carries the argument
The inference-time knowledge graph that starts as a seed extracted from the question, expands via internal prompting, and receives selective external refinement to organize facts for accurate generation.
If this is right
- Structured graphs reduce the influence of irrelevant information on LLM outputs.
- Compositional reasoning improves because facts are linked explicitly rather than scattered in text passages.
- External retrieval becomes more targeted once a partial internal graph already exists.
- The method operates without retraining and remains scalable at inference time.
- Interpretability increases because the constructed graph can be inspected to trace which facts informed the answer.
Where Pith is reading between the lines
- The same graph-construction loop could be tested on tasks beyond QA such as multi-hop reasoning or long-form generation where structure matters.
- If the quality of the internally expanded graph can be measured automatically, that metric might serve as an early stopping signal before external retrieval.
- Hybrid internal-external graphs might generalize to domains where parametric memory alone is insufficient but full retrieval is costly.
Load-bearing premise
Prompting the LLM to extract and iteratively expand a seed knowledge graph produces a structure accurate and complete enough that external retrieval can refine it without introducing new errors or omitting key facts.
What would settle it
Running the method on the same three factual QA benchmarks and finding no gain or a drop in factual accuracy relative to the unstructured retrieval baselines would falsify the central claim.
Figures
read the original abstract
Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) paradigms mitigate this issue by incorporating external knowledge at inference time. However, such methods typically handle knowledge as unstructured text, which reduces retrieval accuracy, hinders compositional reasoning, and amplifies the influence of irrelevant information on the factual consistency of LLM outputs. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external knowledge retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's internal knowledge. The KG is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse Factual QA benchmarks, demonstrating consistent gains in factual accuracy over baselines. Our findings reveal that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an inference-time framework for dynamically constructing and expanding knowledge graphs to improve factual consistency in LLMs. The method extracts a seed KG from the input question via prompting, iteratively expands it using the LLM's internal parametric knowledge, selectively refines the graph through external retrieval, and uses the resulting structure to guide generation. Evaluation on three factual QA benchmarks is reported to yield consistent gains in factual accuracy relative to baselines.
Significance. If the central empirical claims are supported by detailed metrics, ablations, and controls for error accumulation, the work could advance structured alternatives to standard RAG by combining internal and external knowledge in an interpretable graph format. The inference-time KG construction offers potential advantages for compositional reasoning and reduced influence of irrelevant information. The absence of quantitative validation for the expansion and refinement stages currently limits assessment of whether the approach reliably mitigates rather than amplifies hallucinations.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: the claim of 'consistent gains in factual accuracy over baselines' on three benchmarks provides no details on the precise metrics employed (e.g., exact match, F1, or hallucination rate), baseline implementations, number of runs, or statistical significance testing. This information is load-bearing for the central empirical claim.
- [§3 (Proposed Method)] §3 (Proposed Method), iterative expansion step: no triple-level precision/recall against gold facts or ablation removing the expansion loop is reported. Without such evidence it remains unclear whether LLM-driven expansion compounds hallucinations before external refinement can correct them, directly affecting the soundness of the hybrid KG construction claim.
minor comments (2)
- [§3] Clarify the exact prompting templates and stopping criteria for seed extraction and iterative expansion to support reproducibility.
- [Figures] Ensure all figures showing KG examples include node/edge labels and legend for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for improving the transparency of our empirical results and the validation of the method components. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the claim of 'consistent gains in factual accuracy over baselines' on three benchmarks provides no details on the precise metrics employed (e.g., exact match, F1, or hallucination rate), baseline implementations, number of runs, or statistical significance testing. This information is load-bearing for the central empirical claim.
Authors: We agree that the abstract and evaluation reporting would benefit from greater specificity to support the central claims. In the revised manuscript we will update the abstract to name the metrics explicitly (Exact Match and token-level F1 for factual accuracy, plus a hallucination rate computed via an external fact-verification model). The Evaluation section will be expanded to describe all baseline implementations (including standard RAG variants and prior KG-augmented methods), report results averaged over three independent runs with standard deviations, and include paired t-test p-values for statistical significance. These additions will be made without altering the existing experimental results. revision: yes
-
Referee: [§3 (Proposed Method)] §3 (Proposed Method), iterative expansion step: no triple-level precision/recall against gold facts or ablation removing the expansion loop is reported. Without such evidence it remains unclear whether LLM-driven expansion compounds hallucinations before external refinement can correct them, directly affecting the soundness of the hybrid KG construction claim.
Authors: We acknowledge that triple-level precision and recall against gold facts would offer direct insight into the expansion step. Constructing such gold annotations for dynamically generated triples, however, requires exhaustive manual labeling that exceeds the scope of the current experiments. To directly address the concern about error accumulation, we will add a new ablation in the revised manuscript that disables the iterative expansion loop and reports the resulting factual accuracy on all three benchmarks. We will also insert qualitative examples and discussion in §3 illustrating how the subsequent selective refinement stage corrects inaccuracies introduced during expansion. These changes will provide quantitative evidence on the net contribution of the expansion component. revision: partial
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper describes an inference-time procedure that extracts a seed KG by prompting, expands it internally, then refines it with external retrieval before generation. The central claim is an empirical improvement on three external factual QA benchmarks. No equations, fitted parameters, or self-referential predictions appear in the abstract or method description. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked. The result is therefore not equivalent to its inputs by construction and rests on external benchmark outcomes rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can extract a seed knowledge graph from a question via prompting that is accurate enough to serve as a foundation for further expansion.
- domain assumption Iterative expansion using the LLM's internal knowledge improves coverage without proportionally increasing hallucinations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's internal knowledge. The KG is then selectively refined through external retrieval...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate our approach on three diverse Factual QA benchmarks, demonstrating consistent gains in factual accuracy over baselines.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Xuefeng Bai, Song He, Yi Li, Yabo Xie, Xin Zhang, Wenli Du, and Jian-Rong Li. 2025. Construction of a knowledge graph for framework material enabled by large language models and its application. npj Computational Materials, 11(1):51
work page 2025
-
[4]
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and 1 others. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682--17690
work page 2024
-
[5]
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI)
work page 2015
- [6]
- [7]
-
[8]
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. NAACL
work page 2019
-
[9]
Mohnish Dubey, Prateek Banerjee, Ahmed Abdelkawi, and Jens Lehmann. 2019. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. The Semantic Web
work page 2019
- [10]
- [11]
-
[12]
Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, and 1 others. 2024. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, and 1 others. 2024. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. arXiv preprint arXiv:2404.05221
-
[14]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 others. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1--55
work page 2025
-
[15]
Nikitas Karanikolas, Eirini Manga, Nikoletta Samaridi, Eleni Tousidou, and Michael Vassilakopoulos. 2023. Large language models versus natural language understanding and generation. In Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics, pages 278--290
work page 2023
- [16]
-
[17]
Pranjal Kumar. 2024. Large language models (llms): survey, technical frameworks, and future challenges. Artificial Intelligence Review, 57(10):260
work page 2024
- [18]
-
[19]
Qian Li, Zhuo Chen, Cheng Ji, Shiqi Jiang, and Jianxin Li. 2024. Llm-based multi-level knowledge generation for few-shot knowledge graph completion. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, volume 271494703
work page 2024
-
[20]
Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, and 1 others. 2025. Debate on graph: a flexible and reliable reasoning framework for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24768--24776
work page 2025
-
[21]
Steven Moore, Richard Tong, Anjali Singh, Zitao Liu, Xiangen Hu, Yu Lu, Joleen Liang, Chen Cao, Hassan Khosravi, Paul Denny, and 1 others. 2023. Empowering education with llms-the next-gen interface and content generation. In International Conference on Artificial Intelligence in Education, pages 32--37. Springer
work page 2023
-
[22]
Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. 2023. Llms for science: Usage for code generation and data analysis. Journal of Software: Evolution and Process, page e2723
work page 2023
-
[23]
Jixuan Nie, Xia Hou, Wenfeng Song, Xuan Wang, Xinyu Zhang, Xingliang Jin, Shuozhe Zhang, and Jiaqi Shi. 2024. Knowledge graph efficient construction: Embedding chain-of-thought into llms. Proceedings of the VLDB Endowment. ISSN, 2150:8097
work page 2024
-
[24]
Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36(7):3580--3599
work page 2024
-
[25]
Stephen Robertson, Hugo Zaragoza, and 1 others. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3(4):333--389
work page 2009
- [26]
-
[27]
Priyanka Sen, Sandeep Mavadia, and Amir Saffari. 2023. Knowledge graph-augmented language models for complex question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 1--8
work page 2023
- [28]
- [29]
-
[30]
Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL
work page 2018
- [31]
- [32]
-
[33]
Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Georgiev, Rocktim Jyoti Das, and Preslav Nakov. 2024 b . Factuality of large language models in the year 2024. CoRR
work page 2024
-
[34]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837
work page 2022
- [35]
- [36]
- [37]
-
[38]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [39]
-
[40]
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2016. The value of semantic parse labeling for knowledge base question answering. In ACL
work page 2016
- [41]
- [42]
-
[43]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[44]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.