arxiv: 2603.21440 · v4 · submitted 2026-03-22 · 💻 cs.CL · cs.AI

Recognition: no theorem link

KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

Shuai Wang , Yinan Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords knowledge graph reasoningreinforcement learningmulti-hop reasoningknowledge base question answeringlarge language modelsunified reasoningcompact modelsbacktracking

0 comments

The pith

KG-Hopper trains a 7B open LLM via RL to embed full multi-hop KG traversal and backtracking into one unified thinking stage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KG-Hopper, a reinforcement learning framework that trains compact open large language models to perform multi-hop reasoning over knowledge graphs in a single inference pass. Rather than following predefined sequential pipelines that isolate each step and propagate errors, the method folds the entire traversal, decision process, and backtracking into one global thinking stage. This lets the model optimize over cross-step dependencies at once. Experiments across eight KG reasoning benchmarks show the resulting 7B model surpassing multi-step systems with up to 70B parameters and matching proprietary models such as GPT-3.5-Turbo and GPT-4o-mini while remaining open and data-efficient.

Core claim

KG-Hopper is a reinforcement learning framework that empowers compact open LLMs to perform integrated multi-hop KG reasoning within a single inference round by training a Reasoning LLM that embeds the entire KG traversal and decision process into a unified thinking stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking.

What carries the argument

Unified thinking stage produced by RL training that integrates full KG traversal, decisions, and backtracking into one inference pass.

Load-bearing premise

Reinforcement learning can embed the complete KG traversal, decision logic, and backtracking into a single unified thinking stage without sequential error cascades.

What would settle it

A new benchmark with deeper cross-step dependencies where the 7B KG-Hopper model falls below the accuracy of a tuned 70B sequential baseline would falsify the unified-stage advantage.

Figures

Figures reproduced from arXiv: 2603.21440 by Shuai Wang, Yinan Yu.

**Figure 2.** Figure 2: The RL training process under two settings: with and without history resampling ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes KG-Hopper, an RL framework that trains a 7B-parameter open LLM to embed entire multi-hop KG traversals, decisions, and backtracking into a single unified thinking stage rather than sequential pipelines. This is claimed to enable global reasoning over cross-step dependencies without error cascades. On eight KG reasoning benchmarks the 7B model is reported to outperform multi-step systems up to 70B parameters and to reach competitive accuracy with GPT-3.5-Turbo and GPT-4o-mini while remaining compact, open, and data-efficient; public code is released.

Significance. If the central claim holds, the work would demonstrate that RL can induce structured, globally consistent KG reasoning inside a single forward pass of a compact open model, offering a practical route to high-performance KBQA without large-scale models or hand-crafted pipelines. The public code release is a clear reproducibility strength.

major comments (3)

[Methods] Methods section: the reward design, KG serialization format, and mechanism for maintaining or backtracking over global path state inside one generation pass are not described in sufficient detail. Without these, it is impossible to determine whether the reported gains arise from true cross-step reasoning or from local next-hop prediction / path memorization.
[Experiments] Experiments section (results tables): no error bars, statistical significance tests, or ablation on reward components are provided, so the claim that the 7B model “consistently outperforms” 70B baselines cannot be evaluated for robustness.
[§4] §4 (baseline comparisons): it is unclear whether the GPT-3.5/GPT-4o-mini baselines receive identical KG access and serialization or are evaluated zero-shot; this directly affects the interpretation of “competitive performance.”

minor comments (1)

[Abstract / Experiments] The abstract states “eight KG reasoning benchmarks” but does not list them; the experimental section should include an explicit table or appendix enumerating the datasets and their statistics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and rigor, and we have revised the manuscript to address them directly.

read point-by-point responses

Referee: [Methods] Methods section: the reward design, KG serialization format, and mechanism for maintaining or backtracking over global path state inside one generation pass are not described in sufficient detail. Without these, it is impossible to determine whether the reported gains arise from true cross-step reasoning or from local next-hop prediction / path memorization.

Authors: We agree that the original Methods section lacked sufficient detail. In the revised manuscript we have expanded it to fully specify the reward function (with explicit terms for path accuracy, backtracking penalty, and global consistency), the precise KG serialization format (a structured token sequence of entities and relations), and the single-pass backtracking mechanism (the model emits a unified thinking trace containing conditional backtrack tokens that are evaluated against the full path state within one generation). These additions make clear that performance gains derive from integrated cross-step reasoning rather than local memorization. revision: yes
Referee: [Experiments] Experiments section (results tables): no error bars, statistical significance tests, or ablation on reward components are provided, so the claim that the 7B model “consistently outperforms” 70B baselines cannot be evaluated for robustness.

Authors: We acknowledge the absence of statistical reporting. The revised Experiments section now includes error bars (standard deviation over five independent runs), paired t-test p-values for all comparisons against the 70B baselines, and a new ablation table isolating each reward component. These additions allow direct evaluation of robustness and confirm that the reported gains are statistically significant and attributable to the full reward design. revision: yes
Referee: [§4] §4 (baseline comparisons): it is unclear whether the GPT-3.5/GPT-4o-mini baselines receive identical KG access and serialization or are evaluated zero-shot; this directly affects the interpretation of “competitive performance.”

Authors: We apologize for the ambiguity. All baselines, including GPT-3.5-Turbo and GPT-4o-mini, were given exactly the same KG serialization and access as KG-Hopper; they were not zero-shot. The revised §4 now explicitly states this and includes the prompt templates used for the proprietary models, ensuring the comparison is fair and the competitive performance claim is correctly interpreted. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL results on external benchmarks

full rationale

The paper describes an RL training procedure for a 7B LLM to perform unified KG reasoning in a single pass, with all central claims resting on experimental outcomes across eight public benchmarks rather than any closed-form derivation or self-referential equations. No mathematical steps reduce a prediction to a fitted input by construction, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The method is presented as a trainable policy whose success is measured against independent test sets and larger baselines, rendering the reported performance self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about RL applicability to LLM reasoning and the utility of KGs for multi-hop tasks; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (2)

domain assumption Reinforcement learning can be used to train LLMs to improve multi-step reasoning performance
Core premise of the proposed training method.
domain assumption Knowledge graphs provide reliable structured data for evaluating multi-hop reasoning
Foundation of the KBQA benchmarks used.

pith-pipeline@v0.9.0 · 5524 in / 1390 out tokens · 64919 ms · 2026-05-15T06:20:05.419475+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering
cs.AI 2026-05 conditional novelty 6.0

PathISE generates pseudo path-level supervision from answer labels alone via a transformer estimator, distills it to an LLM path generator, and achieves competitive or state-of-the-art KGQA performance on three benchm...
KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Plugging schema graph into multi-table qa: A human-guided framework for reducing llm reliance,

X. Wang, M. Costa, J. Kovaceva, S. Wang, and F. C. Pereira, “Plugging schema graph into multi-table qa: A human-guided framework for reducing llm reliance,”arXiv preprint arXiv:2506.04427, 2025

work page arXiv 2025
[2]

iQUEST: An iterative question-guided framework for knowledge base question answering,

S. Wang and Y . Yu, “iQUEST: An iterative question-guided framework for knowledge base question answering,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2025, pp. 15 616–15 628. [Online]. Available: https://aclanthology.org/2025.acl-long.760/

work page 2025
[3]

Deeppath: A reinforcement learning method for knowledge graph reasoning,

W. Xiong, T. Hoang, and W. Y . Wang, “Deeppath: A reinforcement learning method for knowledge graph reasoning,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 564–573

work page 2017
[4]

Domagent: Leveraging knowledge graphs and case-based reasoning for domain-specific code generation,

S. Wang, D. Parthasarathy, R. Feldt, and Y . Yu, “Domagent: Leveraging knowledge graphs and case-based reasoning for domain-specific code generation,”arXiv preprint arXiv:2603.21430, 2026

work page arXiv 2026
[5]

Multi-hop knowledge graph reasoning with reward shaping,

X. V . Lin, R. Socher, and C. Xiong, “Multi-hop knowledge graph reasoning with reward shaping,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3243– 3253

work page 2018
[6]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Chain-of-thought tokens are computer program variables,

F. Zhu, P. Wang, and Z. Sui, “Chain-of-thought tokens are computer program variables,”arXiv preprint arXiv:2505.04955, 2025

work page arXiv 2025
[9]

Srpo: A cross-domain implemen- tation of large-scale reinforcement learning on llm,

X. Zhang, J. Wang, Z. Cheng, W. Zhuang, Z. Lin, M. Zhang, S. Wang, Y . Cui, C. Wang, J. Penget al., “Srpo: A cross-domain implemen- tation of large-scale reinforcement learning on llm,”arXiv preprint arXiv:2504.14286, 2025

work page arXiv 2025
[10]

Curriculum learning for reinforcement learning domains: A framework and survey,

S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone, “Curriculum learning for reinforcement learning domains: A framework and survey,”Journal of Machine Learning Research, vol. 21, no. 181, pp. 1–50, 2020

work page 2020
[11]

The web as a knowledge-base for answering complex questions,

A. Talmor and J. Berant, “The web as a knowledge-base for answering complex questions,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 641–651

work page 2018
[12]

The value of semantic parse labeling for knowledge base question answering,

W.-t. Yih, M. Richardson, C. Meek, M.-W. Chang, and J. Suh, “The value of semantic parse labeling for knowledge base question answering,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 201– 206

work page 2016
[13]

Semantic parsing on freebase from question-answer pairs,

J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on freebase from question-answer pairs,” inProceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1533–1544

work page 2013
[14]

Beyond iid: three levels of generalization for question answering on knowledge bases,

Y . Gu, S. Kase, M. Vanni, B. Sadler, P. Liang, X. Yan, and Y . Su, “Beyond iid: three levels of generalization for question answering on knowledge bases,” inProceedings of the Web Conference 2021, 2021, pp. 3477–3488

work page 2021
[15]

Qald-9- plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers,

A. Perevalov, D. Diefenbach, R. Usbeck, and A. Both, “Qald-9- plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers,” in2022 IEEE 16th International Conference on Semantic Computing (ICSC). IEEE, 2022, pp. 229–234

work page 2022
[16]

T-rex: A large scale alignment of natural language with knowledge base triples,

H. Elsahar, P. V ougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, and E. Simperl, “T-rex: A large scale alignment of natural language with knowledge base triples,” inProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018

work page 2018
[17]

Kilt: a benchmark for knowledge intensive language tasks,

F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. De Cao, J. Thorne, Y . Jernite, V . Karpukhin, J. Maillardet al., “Kilt: a benchmark for knowledge intensive language tasks,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2523–2544

work page 2021
[18]

Creak: A dataset for commonsense reasoning over entity knowledge,

Y . Onoe, M. J. Zhang, E. Choi, and G. Durrett, “Creak: A dataset for commonsense reasoning over entity knowledge,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

work page
[19]

Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,

J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y . Gong, L. Ni, H.-Y . Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[20]

KG-CoT: Chain- of-thought prompting of large language models over knowledge graphs for knowledge-aware question answering,

R. Zhao, F. Zhao, L. Wang, X. Wang, and G. Xu, “KG-CoT: Chain- of-thought prompting of large language models over knowledge graphs for knowledge-aware question answering,” inProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence (IJCAI-24). International Joint Conferences on Artificial Intelligence, 2024, pp. 6642– 6650

work page 2024
[21]

Interactive-KBQA: Multi-turn inter- actions for knowledge base question answering with large language models,

G. Xiong, J. Bao, and W. Zhao, “Interactive-KBQA: Multi-turn inter- actions for knowledge base question answering with large language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds., Aug. 2024, pp. 10 561–10 582

work page 2024
[22]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations,

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9426–9439

work page 2024
[23]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

H. Song, J. Jiang, Y . Min, J. Chen, Z. Chen, W. X. Zhao, L. Fang, and J.-R. Wen, “R1-searcher: Incentivizing the search capability in llms via reinforcement learning,”arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Complex question decomposition for semantic parsing,

H. Zhang, J. Cai, J. Xu, and J. Wang, “Complex question decomposition for semantic parsing,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4477–4486

work page 2019
[25]

ArcaneQA: Dynamic program induction and contextu- alized encoding for knowledge base question answering,

Y . Gu and Y . Su, “ArcaneQA: Dynamic program induction and contextu- alized encoding for knowledge base question answering,” inProceedings of the 29th International Conference on Computational Linguistics, N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue,...

work page 2022
[26]

Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning,

R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishna- murthy, A. Smola, and A. McCallum, “Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning,” inInternational Conference on Learning Representations, 2018

work page 2018
[27]

A collaborative reasoning framework powered by reinforcement learning and large language models for complex questions answering over knowledge graph,

Z. Zhang and W. Zhao, “A collaborative reasoning framework powered by reinforcement learning and large language models for complex questions answering over knowledge graph,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 10 672– 10 684

work page 2025
[28]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[29]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chenet al., “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Webthinker: Empowering large reasoning models with deep research capability,

X. Li, J. Jin, G. Dong, H. Qian, Y . Zhu, Y . Wu, J.-R. Wen, and Z. Dou, “Webthinker: Empowering large reasoning models with deep research capability,”arXiv preprint arXiv:2504.21776, 2025

work page arXiv 2025