pith. machine review for the scientific record.
sign in

arxiv: 2509.03540 · v3 · submitted 2025-08-31 · 💻 cs.CL · cs.AI

Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Pith reviewed 2026-05-18 19:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords knowledge graphsfactualityretrieval-augmented generationlarge language modelsinference-time methodsquestion answeringfactual consistency
0
0 comments X

The pith

Dynamically constructing knowledge graphs at inference time improves factual accuracy in large language models by structuring internal and external knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLMs can produce more factually consistent answers by building and refining knowledge graphs during inference rather than relying on unstructured text retrieval. The process begins with prompting the model to extract a seed graph from the question, expands that graph using the model's own internal knowledge, and then selectively pulls in external facts to correct gaps and errors. This structured approach is tested on three factual question-answering benchmarks where it delivers consistent accuracy gains over standard retrieval baselines. A sympathetic reader would care because organizing knowledge into graphs supports better compositional reasoning and limits the sway of irrelevant details on the final output.

Core claim

By extracting a seed knowledge graph from the question via prompting, iteratively expanding it with the LLM's internal knowledge, and then selectively refining it through external retrieval, the framework enhances factual coverage, corrects inaccuracies, and produces more consistent answers than methods that treat knowledge as unstructured text.

What carries the argument

The inference-time knowledge graph that starts as a seed extracted from the question, expands via internal prompting, and receives selective external refinement to organize facts for accurate generation.

If this is right

  • Structured graphs reduce the influence of irrelevant information on LLM outputs.
  • Compositional reasoning improves because facts are linked explicitly rather than scattered in text passages.
  • External retrieval becomes more targeted once a partial internal graph already exists.
  • The method operates without retraining and remains scalable at inference time.
  • Interpretability increases because the constructed graph can be inspected to trace which facts informed the answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-construction loop could be tested on tasks beyond QA such as multi-hop reasoning or long-form generation where structure matters.
  • If the quality of the internally expanded graph can be measured automatically, that metric might serve as an early stopping signal before external retrieval.
  • Hybrid internal-external graphs might generalize to domains where parametric memory alone is insufficient but full retrieval is costly.

Load-bearing premise

Prompting the LLM to extract and iteratively expand a seed knowledge graph produces a structure accurate and complete enough that external retrieval can refine it without introducing new errors or omitting key facts.

What would settle it

Running the method on the same three factual QA benchmarks and finding no gain or a drop in factual accuracy relative to the unstructured retrieval baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.03540 by Jinho D. Choi, Kai Shu, Lihui Liu, Shanglin Wu.

Figure 1
Figure 1. Figure 1: Comparison of three methods for answering [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our pipeline. (A) Graph initialization, in which the input question is parsed by the LLM to extract initial triplets. (B) Graph expansion iteratively explores breadth-first relations from seed entities to build a larger KG. (C) External retrieval, search is performed (e.g., using BM25 on the content returned from wikipedia and google search) to correct or extend selected triplets, which are mer… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy and graph size for five models across different hop counts. Solid lines represent accuracy, while [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy and recall across different hop [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) paradigms mitigate this issue by incorporating external knowledge at inference time. However, such methods typically handle knowledge as unstructured text, which reduces retrieval accuracy, hinders compositional reasoning, and amplifies the influence of irrelevant information on the factual consistency of LLM outputs. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external knowledge retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's internal knowledge. The KG is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse Factual QA benchmarks, demonstrating consistent gains in factual accuracy over baselines. Our findings reveal that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an inference-time framework for dynamically constructing and expanding knowledge graphs to improve factual consistency in LLMs. The method extracts a seed KG from the input question via prompting, iteratively expands it using the LLM's internal parametric knowledge, selectively refines the graph through external retrieval, and uses the resulting structure to guide generation. Evaluation on three factual QA benchmarks is reported to yield consistent gains in factual accuracy relative to baselines.

Significance. If the central empirical claims are supported by detailed metrics, ablations, and controls for error accumulation, the work could advance structured alternatives to standard RAG by combining internal and external knowledge in an interpretable graph format. The inference-time KG construction offers potential advantages for compositional reasoning and reduced influence of irrelevant information. The absence of quantitative validation for the expansion and refinement stages currently limits assessment of whether the approach reliably mitigates rather than amplifies hallucinations.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: the claim of 'consistent gains in factual accuracy over baselines' on three benchmarks provides no details on the precise metrics employed (e.g., exact match, F1, or hallucination rate), baseline implementations, number of runs, or statistical significance testing. This information is load-bearing for the central empirical claim.
  2. [§3 (Proposed Method)] §3 (Proposed Method), iterative expansion step: no triple-level precision/recall against gold facts or ablation removing the expansion loop is reported. Without such evidence it remains unclear whether LLM-driven expansion compounds hallucinations before external refinement can correct them, directly affecting the soundness of the hybrid KG construction claim.
minor comments (2)
  1. [§3] Clarify the exact prompting templates and stopping criteria for seed extraction and iterative expansion to support reproducibility.
  2. [Figures] Ensure all figures showing KG examples include node/edge labels and legend for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the transparency of our empirical results and the validation of the method components. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the claim of 'consistent gains in factual accuracy over baselines' on three benchmarks provides no details on the precise metrics employed (e.g., exact match, F1, or hallucination rate), baseline implementations, number of runs, or statistical significance testing. This information is load-bearing for the central empirical claim.

    Authors: We agree that the abstract and evaluation reporting would benefit from greater specificity to support the central claims. In the revised manuscript we will update the abstract to name the metrics explicitly (Exact Match and token-level F1 for factual accuracy, plus a hallucination rate computed via an external fact-verification model). The Evaluation section will be expanded to describe all baseline implementations (including standard RAG variants and prior KG-augmented methods), report results averaged over three independent runs with standard deviations, and include paired t-test p-values for statistical significance. These additions will be made without altering the existing experimental results. revision: yes

  2. Referee: [§3 (Proposed Method)] §3 (Proposed Method), iterative expansion step: no triple-level precision/recall against gold facts or ablation removing the expansion loop is reported. Without such evidence it remains unclear whether LLM-driven expansion compounds hallucinations before external refinement can correct them, directly affecting the soundness of the hybrid KG construction claim.

    Authors: We acknowledge that triple-level precision and recall against gold facts would offer direct insight into the expansion step. Constructing such gold annotations for dynamically generated triples, however, requires exhaustive manual labeling that exceeds the scope of the current experiments. To directly address the concern about error accumulation, we will add a new ablation in the revised manuscript that disables the iterative expansion loop and reports the resulting factual accuracy on all three benchmarks. We will also insert qualitative examples and discussion in §3 illustrating how the subsequent selective refinement stage corrects inaccuracies introduced during expansion. These changes will provide quantitative evidence on the net contribution of the expansion component. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper describes an inference-time procedure that extracts a seed KG by prompting, expands it internally, then refines it with external retrieval before generation. The central claim is an empirical improvement on three external factual QA benchmarks. No equations, fitted parameters, or self-referential predictions appear in the abstract or method description. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked. The result is therefore not equivalent to its inputs by construction and rests on external benchmark outcomes rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested premise that LLMs can reliably perform KG extraction and expansion without introducing factual errors that external retrieval cannot fully correct.

axioms (2)
  • domain assumption LLMs can extract a seed knowledge graph from a question via prompting that is accurate enough to serve as a foundation for further expansion.
    Invoked in the description of the first step of the method.
  • domain assumption Iterative expansion using the LLM's internal knowledge improves coverage without proportionally increasing hallucinations.
    Stated as part of the iterative expansion phase.

pith-pipeline@v0.9.0 · 5718 in / 1342 out tokens · 28872 ms · 2026-05-18T19:23:00.714971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. 2023. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689

  2. [2]

    Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136

  3. [3]

    Xuefeng Bai, Song He, Yi Li, Yabo Xie, Xin Zhang, Wenli Du, and Jian-Rong Li. 2025. Construction of a knowledge graph for framework material enabled by large language models and its application. npj Computational Materials, 11(1):51

  4. [4]

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and 1 others. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682--17690

  5. [5]

    Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI)

  6. [6]

    Roi Cohen, Mor Geva, Jonathan Berant, and Amir Globerson. 2023. Crawling the internal knowledge-base of language models. arXiv preprint arXiv:2301.12810

  7. [7]

    Linyi Ding, Sizhe Zhou, Jinfeng Xiao, and Jiawei Han. 2024. Automated construction of theme-specific knowledge graphs. arXiv preprint arXiv:2404.19146

  8. [8]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. NAACL

  9. [9]

    Mohnish Dubey, Prateek Banerjee, Ahmed Abdelkawi, and Jens Lehmann. 2019. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. The Semantic Web

  10. [10]

    Jiazhan Feng, Ruochen Xu, Junheng Hao, Hiteshi Sharma, Yelong Shen, Dongyan Zhao, and Weizhu Chen. 2023. Language models can be logical solvers. arXiv preprint arXiv:2311.06158

  11. [11]

    Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. 2025. Inside-out: Hidden factual knowledge in llms. arXiv preprint arXiv:2503.15299

  12. [12]

    Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, and 1 others. 2024. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:2501.00309

  13. [13]

    Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, and 1 others. 2024. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. arXiv preprint arXiv:2404.05221

  14. [14]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 others. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1--55

  15. [15]

    Nikitas Karanikolas, Eirini Manga, Nikoletta Samaridi, Eleni Tousidou, and Michael Vassilakopoulos. 2023. Large language models versus natural language understanding and generation. In Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics, pages 278--290

  16. [16]

    Vamsi Krishna Kommineni, Birgitta K \"o nig-Ries, and Sheeba Samuel. 2024. From human experts to machines: An llm supported approach to ontology and knowledge graph construction. arXiv preprint arXiv:2403.08345

  17. [17]

    Pranjal Kumar. 2024. Large language models (llms): survey, technical frameworks, and future challenges. Artificial Intelligence Review, 57(10):260

  18. [18]

    Zhicheng Lee, Shulin Cao, Jinxin Liu, Jiajie Zhang, Weichuan Liu, Xiaoyin Che, Lei Hou, and Juanzi Li. 2025. Rearag: Knowledge-guided reasoning enhances factuality of large reasoning models with iterative retrieval augmented generation. arXiv preprint arXiv:2503.21729

  19. [19]

    Qian Li, Zhuo Chen, Cheng Ji, Shiqi Jiang, and Jianxin Li. 2024. Llm-based multi-level knowledge generation for few-shot knowledge graph completion. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, volume 271494703

  20. [20]

    Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, and 1 others. 2025. Debate on graph: a flexible and reliable reasoning framework for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24768--24776

  21. [21]

    Steven Moore, Richard Tong, Anjali Singh, Zitao Liu, Xiangen Hu, Yu Lu, Joleen Liang, Chen Cao, Hassan Khosravi, Paul Denny, and 1 others. 2023. Empowering education with llms-the next-gen interface and content generation. In International Conference on Artificial Intelligence in Education, pages 32--37. Springer

  22. [22]

    Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. 2023. Llms for science: Usage for code generation and data analysis. Journal of Software: Evolution and Process, page e2723

  23. [23]

    Jixuan Nie, Xia Hou, Wenfeng Song, Xuan Wang, Xinyu Zhang, Xingliang Jin, Shuozhe Zhang, and Jiaqi Shi. 2024. Knowledge graph efficient construction: Embedding chain-of-thought into llms. Proceedings of the VLDB Endowment. ISSN, 2150:8097

  24. [24]

    Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36(7):3580--3599

  25. [25]

    Stephen Robertson, Hugo Zaragoza, and 1 others. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3(4):333--389

  26. [26]

    Abulhair Saparov and He He. 2022. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240

  27. [27]

    Priyanka Sen, Sandeep Mavadia, and Amir Saffari. 2023. Knowledge graph-augmented language models for complex question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 1--8

  28. [28]

    Haitao Sun and et al. 2022. Conditionalqa: A dataset for complex conditional question answering over narratives. arXiv preprint arXiv:2202.12276

  29. [29]

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. 2023. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. arXiv preprint arXiv:2307.07697

  30. [30]

    Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL

  31. [31]

    Han Wang, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. 2025. Retrieval-augmented generation with conflicting evidence. arXiv preprint arXiv:2504.13079

  32. [32]

    Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, and Preslav Nakov. 2024 a . Openfactcheck: A unified framework for factuality evaluation of llms. arXiv preprint arXiv:2405.05583

  33. [33]

    Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Georgiev, Rocktim Jyoti Das, and Preslav Nakov. 2024 b . Factuality of large language models in the year 2024. CoRR

  34. [34]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  35. [35]

    Jason et al. Wei. 2024. Simpleqa: Evaluating factual consistency in frontier language models. arXiv preprint arXiv:2403.12345

  36. [36]

    Yike Wu, Nan Hu, Sheng Bi, Guilin Qi, Jie Ren, Anhuan Xie, and Wei Song. 2023. Retrieve-rewrite-answer: A kg-to-text enhanced llms framework for knowledge graph question answering. arXiv preprint arXiv:2309.11206

  37. [37]

    Yao Xu, Shizhu He, Jiabei Chen, Zihao Wang, Yangqiu Song, Hanghang Tong, Guang Liu, Kang Liu, and Jun Zhao. 2024. Generate-on-graph: Treat llm as both agent and kg in incomplete knowledge graph question answering. arXiv preprint arXiv:2404.14741

  38. [38]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600

  39. [39]

    Yao Yao, Zuchao Li, and Hai Zhao. 2023. Beyond chain-of-thought, effective graph-of-thought reasoning in language models. arXiv preprint arXiv:2305.16582

  40. [40]

    Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2016. The value of semantic parse labeling for knowledge base question answering. In ACL

  41. [41]

    Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K Qiu, and Lili Qiu. 2024. Retrieval augmented generation (rag) and beyond: A comprehensive survey on how to make your llms use external data more wisely. arXiv preprint arXiv:2409.14924

  42. [42]

    Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? arXiv preprint arXiv:2305.12740

  43. [43]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  44. [44]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...