Recognition: unknown
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3
The pith
PAI-2 improves LLM factual accuracy through adaptive graph traversal and planning on knowledge graphs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAI-2 performs adaptive, iterative information search guided by extracted entities, matched graph vertices, and generated clue-queries within a dynamic multistage pipeline. On Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and DiaASQ it outperforms LightRAG, RAPTOR, and HippoRAG 2, delivering a 4 percent average gain by LLM-as-a-Judge. Graph traversal algorithms such as BeamSearch and WaterCircles improve results by 6 percent over standard flatten retrievers, while the search plan enhancement mechanism supplies an 18 percent boost compared with the disabled version across the six datasets. PAI-2 also reaches state-of-the-art 89 percent information-retention on the MINE-1 7-
What carries the argument
The adaptive multistage query processing pipeline that guides iterative graph search through extracted entities, matched vertices, and generated clue-queries.
Load-bearing premise
The reported gains from planning and traversal will generalize beyond the six tested datasets and the specific LLMs used, and LLM-as-a-Judge will measure factual correctness without its own biases.
What would settle it
Evaluating PAI-2 on an additional benchmark or with a different LLM family and finding no gain or a decline in factual correctness scores would falsify the central claim.
Figures
read the original abstract
We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PersonalAI 2.0 (PAI-2), a framework integrating external knowledge graphs into LLM systems via a dynamic multistage query pipeline that performs adaptive iterative search using extracted entities, matched vertices, and generated clue-queries. It claims empirical improvements over LightRAG, RAPTOR, and HippoRAG 2 on six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue, DiaASQ), including a 4% average gain by LLM-as-a-Judge across four benchmarks, 6% average superiority from graph traversal algorithms (e.g., BeamSearch, WaterCircles) versus flatten retrievers, an 18% boost from the enabled search-plan enhancement mechanism across six datasets, and SOTA 89% information-retention on the MINE-1 benchmark using 7-14B LLMs.
Significance. If the quantitative claims hold under rigorous validation, PAI-2 would offer a concrete advance in GraphRAG by demonstrating the value of planning and traversal mechanisms for reducing hallucinations and improving precision in personalized agents. The ablation results isolating the 18% contribution of the search-plan component and the SOTA result on MINE-1 with modest-sized models constitute reproducible evidence of component-level gains that could inform next-generation context-aware KG systems.
major comments (2)
- [Abstract] Abstract and Evaluation section: The headline performance claims (4% average LLM-as-a-Judge gain on four benchmarks, 6% traversal improvement, 18% search-plan boost) are presented without any description of the experimental protocol, including data splits, judge-model choice, prompt template for the LLM-as-a-Judge, statistical significance tests, error bars, or controls for confounds such as output length or stylistic bias. This absence directly undermines the central assertion that PAI-2 reduces hallucination rates and increases factual precision.
- [Evaluation] Evaluation section: No validation of the LLM-as-a-Judge metric against human judgments, inter-annotator agreement scores, or bias analysis is provided, despite the metric being the sole basis for all reported gains and the claim of superior factual correctness over baselines.
minor comments (1)
- [Abstract] Abstract: 'analogues methods' should read 'analogous methods'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important gaps in methodological transparency. We will revise the manuscript to incorporate detailed experimental protocols and a validation study for the LLM-as-a-Judge metric, thereby strengthening the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: The headline performance claims (4% average LLM-as-a-Judge gain on four benchmarks, 6% traversal improvement, 18% search-plan boost) are presented without any description of the experimental protocol, including data splits, judge-model choice, prompt template for the LLM-as-a-Judge, statistical significance tests, error bars, or controls for confounds such as output length or stylistic bias. This absence directly undermines the central assertion that PAI-2 reduces hallucination rates and increases factual precision.
Authors: We acknowledge that the abstract and Evaluation section lack explicit descriptions of the experimental protocol. In the revised manuscript, we will expand the Evaluation section to detail the data splits used, the specific judge model and its version, the full prompt template for LLM-as-a-Judge, results from statistical significance tests, error bars on all reported metrics, and controls for confounds including output length and stylistic bias. The abstract will be updated to reference these additions. revision: yes
-
Referee: [Evaluation] Evaluation section: No validation of the LLM-as-a-Judge metric against human judgments, inter-annotator agreement scores, or bias analysis is provided, despite the metric being the sole basis for all reported gains and the claim of superior factual correctness over baselines.
Authors: We agree that direct validation of the LLM-as-a-Judge metric is necessary. We will add a new subsection in the revised Evaluation section reporting a human validation study, including agreement rates between LLM-as-a-Judge scores and human annotations, inter-annotator agreement metrics, and an analysis of potential biases. This will be based on a sampled subset of the benchmark outputs. revision: yes
Circularity Check
No circularity: empirical benchmark gains reported directly from external evaluations
full rationale
The paper introduces PAI-2 as an engineering framework for KG-enhanced LLM agents and supports its claims exclusively through direct empirical comparisons on six named external benchmarks (Natural Questions, TriviaQA, HotpotQA, etc.). Reported improvements (4% average by LLM-as-a-Judge, 6% from traversal algorithms, 18% from search-plan enhancement) are presented as measured outcomes against baselines such as LightRAG and HippoRAG 2, with an ablation study on MINE-1. No equations, fitted parameters, self-definitional quantities, or predictions derived from internal inputs appear in the provided text; the derivation chain consists of system description followed by independent benchmark scoring rather than any reduction of results to the method's own definitions or prior self-citations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3 technical report, 2025
An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Y ang, Jianhong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin ...
2025
-
[2]
Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Y ang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Hua...
2025
-
[3]
Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025
5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Y ao Wei, Y ean Cheng, Yifan An, Yilin Niu, Y uanhao Wen, Y ushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, ...
2025
-
[4]
Wikontic: Constructing Wikidata-aligned, ontology-aware knowledge graphs with large language models
Alla Chepurova, Aydar Bulatov, Mikhail Burtsev, and Y uri Kuratov. Wikontic: Constructing Wikidata-aligned, ontology-aware knowledge graphs with large language models. In V era Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V ol- ume 1: Long Pap...
2026
-
[5]
Autoschemakg: Autonomous knowledge graph construction through dynamic schema induction from web-scale corpora, 2025
Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Y auwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Y ao, Huiwen Y ang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, and Y angqiu Song. Autoschemakg: Autonomous knowledge graph construction through dynamic schema induction from web-scale corpora, 2025
2025
-
[6]
T-grag: A dynamic graphrag framework for resolving temporal conflicts and redundancy in knowledge retrieval
Dong Li, Yichen Niu, Ying Ai, Xiang Zou, Biqing Qi, and Jianxing Liu. T-grag: A dynamic graphrag framework for resolving temporal conflicts and redundancy in knowledge retrieval. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11880–11889, New Y ork, NY , USA, 2025. Association for Computing Machinery
2025
-
[7]
Retrieval-Augmented Generation for Large Language Models: A Survey
Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jin Pan, Y uxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval- augmented generation for large language models: A survey.ArXiv, abs/2312.10997, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Graph retrieval-augmented genera- tion: A survey.ACM Trans
Boci Peng, Y un Zhu, Y ongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Y an Zhang, and Siliang Tang. Graph retrieval-augmented genera- tion: A survey.ACM Trans. Inf. Syst., 44(2), December 2025
2025
-
[9]
GRAG: Graph retrieval-augmented generation
Y untong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. GRAG: Graph retrieval-augmented generation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Compu- tational Linguistics: NAACL 2025, pages 4145–4157, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics
2025
-
[10]
GNN-RAG: Graph neural retrieval for efficient large language model reasoning on knowledge graphs
Costas Mavromatis and George Karypis. GNN-RAG: Graph neural retrieval for efficient large language model reasoning on knowledge graphs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 16682–16699, Vienna, Austria, July 2025. Association for...
2025
-
[11]
Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models
Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, and Shirui Pan. Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models. InF orty-second International Con- ference on Machine Learning, 2025
2025
-
[12]
Per- sonalAI: A Systematic Comparison of Knowledge Graph Storage and Re- trieval Approaches for Personalized LLM Agents.IEEE Access, 14:58262– 58281, 2026
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Petr Anokhin, Nikita Semenov, and Evgeny Burnaev. Per- sonalAI: A Systematic Comparison of Knowledge Graph Storage and Re- trieval Approaches for Personalized LLM Agents.IEEE Access, 14:58262– 58281, 2026
2026
-
[13]
Ni, Heung-Y eung Shum, and Jian Guo
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Y eyun Gong, Lionel M. Ni, Heung-Y eung Shum, and Jian Guo. Think- on-graph: Deep and responsible reasoning of large language model on knowledge graph. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024
2024
-
[14]
Reason- ing on graphs: Faithful and interpretable large language model reasoning
Linhao Luo, Y uan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reason- ing on graphs: Faithful and interpretable large language model reasoning. InInternational Conference on Learning Representations, 2024
2024
-
[15]
Debate on graph: a flexible and reliable reasoning framework for large language models
Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, et al. Debate on graph: a flexible and reliable reasoning framework for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 24768–24776, 2025
2025
-
[16]
An enhanced prompt- based llm reasoning scheme via knowledge graph-integrated collaboration
Yihao Li, Ru Zhang, Jianyi Liu, and Gongshen Liu. An enhanced prompt- based llm reasoning scheme via knowledge graph-integrated collaboration. ArXiv, abs/2402.04978, 2024
-
[17]
Jiaxiang Liu, Tong Zhou, Y ubo Chen, Kang Liu, and Jun Zhao. Enhancing large language models with pseudo- and multisource- knowledge graphs for open-ended question answering.ArXiv, abs/2402.09911, 2025
-
[18]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob VOLUME 15, 2026 35 Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answerin...
2026
-
[19]
Trivi- aQA: A large scale distantly supervised challenge dataset for reading com- prehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Trivi- aQA: A large scale distantly supervised challenge dataset for reading com- prehension. In Regina Barzilay and Min-Y en Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1601–1611, V ancouver, Canada, July
-
[20]
Association for Computational Linguistics
-
[21]
Zhilin Y ang, Peng Qi, Saizheng Zhang, Y oshua Bengio, William Co- hen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proc...
2018
-
[22]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, edi- tors,Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. Internationa...
2020
-
[23]
Musique: Multihop questions via single-hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sab- harwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539– 554, 05 2022
2022
-
[24]
Di- aASQ: A benchmark of conversational aspect-based sentiment quadruple analysis
Bobo Li, Hao Fei, Fei Li, Y uhan Wu, Jinsong Zhang, Shengqiong Wu, Jingye Li, Yijiang Liu, Lizi Liao, Tat-Seng Chua, and Donghong Ji. Di- aASQ: A benchmark of conversational aspect-based sentiment quadruple analysis. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 1...
2023
-
[25]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
2002
-
[26]
Rouge: A package for automatic evaluation of summaries
Chin-Y ew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
2004
-
[27]
Meteor universal: Language specific translation evaluation for any target language
Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. InProceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014
2014
-
[28]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, V arsha Kishore, Felix Wu, Kilian Q. Weinberger, and Y oav Artzi. Bertscore: Evaluating text generation with bert.ArXiv, abs/1904.09675, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[29]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023
2023
-
[30]
Computing krippendorff’s alpha-reliability
Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011
2011
-
[31]
Ligh- tRAG: Simple and fast retrieval-augmented generation
Zirui Guo, Lianghao Xia, Y anhua Y u, Tu Ao, and Chao Huang. Ligh- tRAG: Simple and fast retrieval-augmented generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10746–10761, Suzhou, China, November 2025. As- sociation for Computa...
2025
-
[32]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. Raptor: Recursive abstractive processing for tree-organized retrieval. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[33]
From rag to memory: Non-parametric continual learning for large language models, 2025
Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Y u Su. From rag to memory: Non-parametric continual learning for large language models, 2025
2025
-
[34]
Sorokin, Dmitry Evseev, Mikhail S
Petr Anokhin, Nikita Semenov, Artyom Y . Sorokin, Dmitry Evseev, Mikhail S. Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents. InInternational Joint Conference on Artificial Intelligence, 2024
2024
-
[35]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
2024
-
[36]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.ArXiv, abs/2501.13956, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Extract, define, canonicalize: An LLM- based framework for knowledge graph construction
Bowen Zhang and Harold Soh. Extract, define, canonicalize: An LLM- based framework for knowledge graph construction. In Y aser Al-Onaizan, Mohit Bansal, and Y un-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9820–9836, Miami, Florida, USA, November 2024. Association for Com- putational Linguistics
2024
-
[38]
Kggen: Extracting knowledge graphs from plain text with language models, 2025
Belinda Mo, Kyssen Y u, Joshua Kazdan, Joan Cabezas, Proud Mpala, Lisa Y u, Chris Cundy, Charilaos Kanatsoulis, and Sanmi Koyejo. Kggen: Extracting knowledge graphs from plain text with language models, 2025
2025
-
[39]
From local to global: A graph rag approach to query-focused summarization, 2025
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. M. MENSCHIKOVreceived the B.Sc. degree in Software Engineering from Petrozavodsk State University in 2023 and the M.Sc....
2025
-
[40]
His research interests include generative modeling, mani- fold learning, deep learning for 3D data analysis, multi-agent systems, and industrial applications
He is currently the Director of the AI Center, Skolkovo Institute of Science and Technology, and a Full Professor. His research interests include generative modeling, mani- fold learning, deep learning for 3D data analysis, multi-agent systems, and industrial applications. VOLUME 15, 2026 37
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.