pith. machine review for the scientific record. sign in

arxiv: 2510.11541 · v2 · submitted 2025-10-13 · 💻 cs.LG · cs.AI

Question-Adaptive Graph Learning for Multi-hop Retrieval Augmented Generation

Pith reviewed 2026-05-18 07:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-hop retrievalretrieval-augmented generationgraph neural networksquestion-adaptive learningknowledge graphssynthesized data pre-trainingmulti-level information modeling
0
0 comments X

The pith

A question-adaptive graph neural network on multi-level knowledge graphs improves retrieval accuracy for multi-hop questions in RAG systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework for multi-hop retrieval-augmented generation by first building a Multi-information Level Knowledge Graph that represents questions at varying levels of detail. It then introduces a Question-Adaptive Graph Neural Network whose intra- and inter-level message passing is guided by the question itself to aggregate relevant facts and suppress noise. Pre-training on two kinds of synthesized multi-hop examples further strengthens the learned representations. A sympathetic reader would care because current RAG pipelines frequently miss or mix up the multiple distinct facts required by complex questions.

Core claim

The authors show that question-guided message passing across intra- and inter-level edges on a multi-information-level knowledge graph produces representations that capture complex semantic structure and reduce the effect of irrelevant retrieval noise, with the largest gains appearing after pre-training on synthesized multi-hop data and especially for high-hop questions.

What carries the argument

Quest-GNN, which performs question-guided intra- and inter-level message passing on the Multi-L KG to enable multi-granular aggregation while limiting noise.

If this is right

  • Question-guided aggregation reduces noise in multi-target retrieval.
  • Pre-training on synthesized multi-hop examples transfers to real multi-hop scenarios.
  • Performance gains are largest on high-hop questions, reaching 33.8 percent improvement.
  • Multi-granular information is aggregated more effectively than in standard graph or embedding approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same question-guided passing idea could be tested on single-hop or non-retrieval reasoning tasks where noise is also a problem.
  • Alternative graph-construction heuristics that do not rely on fixed information levels might be compared to the Multi-L KG design.
  • Pairing Quest-GNN with existing reranking or query-rewriting modules could produce additive improvements in full RAG pipelines.

Load-bearing premise

The synthesized data generation strategies produce pre-training examples whose distribution matches real multi-hop questions closely enough for the Quest-GNN to learn robust, generalizable representations that transfer to downstream retrieval tasks.

What would settle it

A controlled experiment that measures retrieval accuracy on a set of real high-hop questions when the model is pre-trained on the paper's synthesized data versus trained from scratch or on mismatched synthetic data would test whether the distribution match is necessary for the reported gains.

Figures

Figures reproduced from arXiv: 2510.11541 by Hao Wang, Peiyan Zhang, Weiming Li, Xiaoshuai Hao, Yatao Bian, Yuchen Yan, Zhihua Liu.

Figure 1
Figure 1. Figure 1: Performance on multi-hop questions. Existing methods struggle to perform well as hop number increases while QSGNN can achieve better performance on high-hop questions. search strategies (Edge et al., 2024; Guo et al., 2024; Gutierrez et al., 2025) or utilize Graph Neural ´ Networks (GNNs) (Fang et al., 2019; Mavromatis & Karypis, 2024; Luo et al., 2025) to identify relevant information. By using KGs, these… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. It first constructs Multi-L KG to model the multi-level relation￾ships within corpora. QSGNN is designed to aggregate information from different levels, all the aggregations are guided by query. After pre-training and fine-tuning, it can generate representations for multi-hop questions. entity-chunk Eoc, entity-document Eod, chunk-chunk Ecc, chunk-document Ecd. The construction of Multi… view at source ↗
Figure 3
Figure 3. Figure 3: Influence of pre-training scale on various model dimension. Pre-training Scale and Information Dimension. We pre-train QSGNN on the sythesized QA pairs from 0 to 150k across information dimension from 32 to 512 respectively, the results are shown in [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Good case 1 for QSGNN. Query: What county is the city that shares a border with the state capital of the state where Andrew Deveaux was born located in? QSGNN Retrieval: Andrew Deveaux\nAndrew Deveaux (30 April 1758 – 11 July 1812) was an American Loyalist from South Carolina who is most famous for his recapture of the Bahamas in 1783. Charleston, South Carolina\nAlthough the city lost the status of state … view at source ↗
Figure 5
Figure 5. Figure 5: Good case 2 for QSGNN. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Good case 3 for QSGNN. in 1786”), and iii) geographic relationships (“Forest Acres borders Columbia”). The successful re￾trieval is due to the query-alignment and comprehensive understanding of contextual information. However, HippoRAG2 incorrectly retrieves Savannah county information due to over-reliance on “county” and “located on” seed nodes. Good Case 3 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bad case 1 for QSGNN. Query: In which province is San Clemente, from the country where Fuser and Alberto meet the indigenous couple who were traveling to look for work? Answer: Talca Province Gold docs: The Motorcycle Diaries (film)\nDuring their expedition, Guevara and Granado encounter the poverty of the indigenous peasants, and the movie assumes a greater seriousness once the men gain a better sense of … view at source ↗
Figure 8
Figure 8. Figure 8: Bad case 2 for QSGNN. Bad Case 2 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Bad case 3 for QSGNN. both pre-training and fine-tuning corpora. The model deflect to general terms like “province” and “county”, retrieving irrelevant documents. Bad Case 3 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sentence extraction prompt for OpenIE. Goal: Your task is to extract named entities from the given paragraph. Respond with a JSON list of entities. Example: - Output -: {"named_entities": ["Radio City", "India", "3 July 2001", "Hindi", "English", "May 2008", "PlanetRadiocity.com"] } - Input -: Radio City\nRadio City is India's first private FM radio station and was started on 3 July 2001. It plays Hindi, … view at source ↗
Figure 11
Figure 11. Figure 11: Entity extraction prompt for OpenIE. A.11 PROMPTS FOR SYNTHESIZED PRE-TRAINING DATA The prompts used for generate pre-training data are shown in [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Triple extraction prompt for OpenIE. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: One hop question generation prompt for OpenIE. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Two hop question generation prompt for OpenIE. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) has demonstrated its ability to enhance Large Language Models (LLMs) by integrating external knowledge sources. However, multi-hop questions, which require the identification of multiple knowledge targets to form a synthesized answer, raise new challenges for RAG systems. Under the multi-hop settings, existing methods often struggle to fully understand the questions with complex semantic structures and are susceptible to irrelevant noise during the retrieval of multiple information targets. To address these limitations, we propose a novel graph representation learning framework for multi-hop question retrieval. We first introduce a Multi-information Level Knowledge Graph (Multi-L KG) to model various information levels for a more comprehensive understanding of multi-hop questions. Based on this, we design a Question-Adaptive Graph Neural Network (Quest-GNN) for representation learning on the Multi-L KG. Quest-GNN employs intra/inter-level message passing mechanisms, and in each message passing the information aggregation is guided by the question, which not only facilitates multi-granular information aggregation but also significantly reduces the impact of noise. To enhance its ability to learn robust representations, we further propose two synthesized data generation strategies for pre-training the Quest-GNN. Extensive experimental results demonstrate the effectiveness of our framework in multi-hop scenarios, especially in high-hop questions the improvement can reach 33.8\%. The code is available at: https://github.com/Jerry2398/QSGNN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a Multi-information Level Knowledge Graph (Multi-L KG) to represent multi-hop questions at varying granularities and a Question-Adaptive Graph Neural Network (Quest-GNN) that performs intra- and inter-level message passing guided by the input question. Two strategies for synthesizing pre-training data are proposed to improve representation robustness, and the framework is evaluated on multi-hop retrieval tasks with a reported peak improvement of 33.8% on high-hop questions.

Significance. If the reported gains prove robust to baseline choice, statistical testing, and pre-training distribution shift, the work would offer a concrete graph-based mechanism for reducing noise in multi-hop retrieval while preserving multi-granular information. The public release of code at https://github.com/Jerry2398/QSGNN is a positive contribution to reproducibility.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline claim of a 33.8% improvement on high-hop questions is presented without naming the strongest baseline, reporting dataset splits, or providing statistical significance, which prevents direct assessment of whether the gain is attributable to the Multi-L KG + Quest-GNN design rather than experimental setup.
  2. [Pre-training / §4] Pre-training subsection: no distributional diagnostics (e.g., hop-count histograms, embedding-space overlap, or KL divergence between synthetic and real multi-hop queries) are supplied to support the assumption that the two synthesized data strategies produce examples whose semantic structure and noise profile transfer to downstream real-world questions; this assumption is load-bearing for the central effectiveness claim.
  3. [Ablation studies] Ablation studies: the contribution of the question-guided aggregation is not isolated from the effect of pre-training data; an ablation that trains Quest-GNN from scratch on the target task (or with random guidance) is needed to establish that the adaptive message-passing mechanism, rather than data artifacts, drives the reported gains.
minor comments (2)
  1. [Method] Notation for the Multi-L KG levels and the intra/inter-level aggregation functions could be clarified with an explicit diagram or additional equations showing how question embeddings modulate the message-passing weights.
  2. [Figures] Figure captions and axis labels in the experimental plots should explicitly state the metric (e.g., recall@K or exact-match) and the number of runs used for error bars.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional clarifications and analyses will strengthen the manuscript and address the concerns about robustness and attribution of gains. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline claim of a 33.8% improvement on high-hop questions is presented without naming the strongest baseline, reporting dataset splits, or providing statistical significance, which prevents direct assessment of whether the gain is attributable to the Multi-L KG + Quest-GNN design rather than experimental setup.

    Authors: We agree that the abstract and experiments would benefit from greater precision to allow direct assessment. In the revised manuscript we will explicitly name the strongest baseline (the best-performing prior method on each dataset), report the exact dataset splits and high-hop question subsets used, and include statistical significance (means and standard deviations over multiple runs together with paired t-test p-values). These additions will make clear that the reported 33.8% improvement on high-hop questions is measured against the strongest available baseline under the same evaluation protocol. revision: yes

  2. Referee: [Pre-training / §4] Pre-training subsection: no distributional diagnostics (e.g., hop-count histograms, embedding-space overlap, or KL divergence between synthetic and real multi-hop queries) are supplied to support the assumption that the two synthesized data strategies produce examples whose semantic structure and noise profile transfer to downstream real-world questions; this assumption is load-bearing for the central effectiveness claim.

    Authors: We acknowledge that explicit distributional diagnostics would provide stronger support for the transferability assumption. While downstream task gains constitute our primary evidence, we will add hop-count histograms comparing the two synthetic pre-training distributions to the real multi-hop queries in the evaluation sets. We will also briefly discuss how the synthesis strategies were designed to preserve semantic structure and noise characteristics. If space permits we will include a simple embedding-space overlap statistic; otherwise the histograms and design rationale will be included in the revised §4. revision: partial

  3. Referee: [Ablation studies] Ablation studies: the contribution of the question-guided aggregation is not isolated from the effect of pre-training data; an ablation that trains Quest-GNN from scratch on the target task (or with random guidance) is needed to establish that the adaptive message-passing mechanism, rather than data artifacts, drives the reported gains.

    Authors: This is a fair request for isolating the source of improvement. We will add two new ablation settings in the revised experiments: (1) Quest-GNN trained from scratch on the target multi-hop retrieval task without any pre-training, and (2) the same architecture using random (non-question-adaptive) guidance during intra- and inter-level message passing. These results will be reported alongside the existing ablations to demonstrate that the question-adaptive aggregation mechanism contributes gains beyond those attributable to the synthesized pre-training data alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper introduces a Multi-L KG and Quest-GNN with question-adaptive message passing, plus two synthesized data strategies for pre-training, then reports empirical gains on multi-hop retrieval tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the claimed improvements to inputs by construction. The derivation chain consists of architectural design choices and data generation heuristics whose effectiveness is tested externally via experiments rather than being tautological. This qualifies as a self-contained empirical contribution against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework rests on the untested premise that question-guided aggregation on a multi-level graph reduces noise more effectively than standard retrieval or GNN baselines, plus the assumption that synthetic pre-training data generalizes.

invented entities (2)
  • Multi-information Level Knowledge Graph (Multi-L KG) no independent evidence
    purpose: Model various information levels for comprehensive understanding of multi-hop questions
    New structure introduced to capture multi-granular information
  • Question-Adaptive Graph Neural Network (Quest-GNN) no independent evidence
    purpose: Perform representation learning with intra/inter-level message passing guided by the question
    Core novel component for noise reduction and multi-granular aggregation

pith-pipeline@v0.9.0 · 5800 in / 1173 out tokens · 30350 ms · 2026-05-18T07:08:43.061908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 13 internal anchors

  1. [1]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024.https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md.,

  2. [2]

    Local graph partitioning using pagerank vectors

    Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pp. 475–486. IEEE,

  3. [3]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over- smoothing problem for graph neural networks from the topological view. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 3438–3445, 2020a. Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding...

  4. [4]

    Multi-hop question answering via reasoning chains

    Jifan Chen, Shih-ting Lin, and Greg Durrett. Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610,

  5. [5]

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pp. 1597–1607. PmLR, 2020b. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring ...

  6. [6]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

  7. [7]

    Hierarchical graph network for multi-hop question answering.arXiv preprint arXiv:1911.03631,

    10 Under review as a conference paper at ICLR 2026 Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. Hierarchical graph network for multi-hop question answering.arXiv preprint arXiv:1911.03631,

  8. [8]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1),

  9. [9]

    LightRAG: Simple and Fast Retrieval-Augmented Generation

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval- augmented generation.arXiv preprint arXiv:2410.05779,

  10. [10]

    From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

    Bernal Jim´enez Guti´errez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models.arXiv preprint arXiv:2502.14802,

  11. [11]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,

  12. [12]

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118,

  13. [13]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7969–7992,

  14. [14]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428,

  15. [15]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281,

  16. [16]

    Bm25s: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

    Xing Han L `u. Bm25s: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

  17. [17]

    Gfm-rag: graph foundation model for retrieval augmented generation.arXiv preprint arXiv:2502.01113,

    Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, and Shirui Pan. Gfm-rag: graph foundation model for retrieval augmented generation.arXiv preprint arXiv:2502.01113,

  18. [18]

    Faithful chain-of-thought reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InThe 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023),

  19. [19]

    Rearev: Adaptive reasoning for question answering over knowledge graphs.arXiv preprint arXiv:2210.13650,

    Costas Mavromatis and George Karypis. Rearev: Adaptive reasoning for question answering over knowledge graphs.arXiv preprint arXiv:2210.13650,

  20. [20]

    Gnn-rag: Graph neural retrieval for large language model reasoning.arXiv preprint arXiv:2405.20139,

    11 Under review as a conference paper at ICLR 2026 Costas Mavromatis and George Karypis. Gnn-rag: Graph neural retrieval for large language model reasoning.arXiv preprint arXiv:2405.20139,

  21. [21]

    Graph retrieval-augmented generation: A survey

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. Graph retrieval-augmented generation: A survey.arXiv preprint arXiv:2408.08921,

  22. [22]

    A survey on oversmoothing in graph neural networks.arXiv preprint arXiv:2303.10993,

    T Konstantin Rusch, Michael M Bronstein, and Siddhartha Mishra. A survey on oversmoothing in graph neural networks.arXiv preprint arXiv:2303.10993,

  23. [23]

    Dragin: dynamic retrieval augmented generation based on the information needs of large language models.arXiv preprint arXiv:2403.10081,

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. Dragin: dynamic retrieval augmented generation based on the information needs of large language models.arXiv preprint arXiv:2403.10081,

  24. [24]

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.♪musique: Multi- hop questions via single-hop question composition.Transactions of the Association for Compu- tational Linguistics, 10:539–554, 2022a. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasonin...

  25. [25]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771,

  26. [26]

    Negative sampling for contrastive representation learning: A review.arXiv preprint arXiv:2206.00212,

    Lanling Xu, Jianxun Lian, Wayne Xin Zhao, Ming Gong, Linjun Shou, Daxin Jiang, Xing Xie, and Ji-Rong Wen. Negative sampling for contrastive representation learning: A review.arXiv preprint arXiv:2206.00212,

  27. [27]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

  28. [28]

    Qa-gnn: Reasoning with language models and knowledge graphs for question answering.arXiv preprint arXiv:2104.06378,

    Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering.arXiv preprint arXiv:2104.06378,

  29. [29]

    A survey on neural open information extraction: Current status and future directions

    Shaowen Zhou, Bowen Yu, Aixin Sun, Cheng Long, Jingyang Li, Haiyang Yu, Jian Sun, and Yong- bin Li. A survey on neural open information extraction: Current status and future directions. arXiv preprint arXiv:2205.11725,

  30. [30]

    Specifically, this appendix is organized as follows

    12 Under review as a conference paper at ICLR 2026 A APPENDIX This supplementary material provides additional details on the proposed method and experimental results that could not be included in the main manuscript due to page limitations. Specifically, this appendix is organized as follows. • Sec. A.1 provides the use of Large Language Models (LLMs). • ...

  31. [31]

    The BM25 retrieval algorithm is implemented as BM25S (L `u, 2024)

    13 Under review as a conference paper at ICLR 2026 Table 7: Dataset statistics MuSiQue 2Wiki HotpotQA #Entity 118,021 53,153 86,147 #Chunk 57,887 23,023 39,830 #Document 15,803 7,403 9,811 #Pre-train 1-hop QA 91,621 52,122 73,128 #Pre-train 2-hop QA 58,923 12,097 20,619 #Pre-train QA 150,544 64,219 93,747 #Fine-tune 2-hop QA 270 782 1000 #Fine-tune 3-hop ...

  32. [32]

    As for the text embedding based model, we use the NV-Embed-v2- 7B and GTE-Qwen2-7B-Instruct from Huggingface (Wolf et al., 2019)

    is implemented with the official code acting as the retrieval server in IRCoT (Trivedi et al., 2022b). As for the text embedding based model, we use the NV-Embed-v2- 7B and GTE-Qwen2-7B-Instruct from Huggingface (Wolf et al., 2019). For the all the GNN based methods and graph search based methods we use their official implementations. Experimental Setting...

  33. [33]

    iv)For GFM-RAG, we use the official model implementation where we pre-train and fine-tune the GNN on our corpora and all the hyperparameters are set as (Luo et al.,

    iii)For text embedding based methods, embeddings are calculated for all the documents within our datasets, we retrieve the 5 most relevant documents as (Jimenez Gutierrez et al., 2024). iv)For GFM-RAG, we use the official model implementation where we pre-train and fine-tune the GNN on our corpora and all the hyperparameters are set as (Luo et al.,

  34. [34]

    vi)For GraphRAG and LightRAG, the implementations are based on the official codes, and the hyperparameters are set as HippoRAG2 (Guti ´errez et al.,

    on our datasets, all the settings are the same as the official settings (Mavromatis & Karypis, 2024). vi)For GraphRAG and LightRAG, the implementations are based on the official codes, and the hyperparameters are set as HippoRAG2 (Guti ´errez et al.,

  35. [35]

    14 Under review as a conference paper at ICLR 2026 A.4 QA PERFORMANCE ONGPT-4O-MINI Table 8: QA performance on GPT-4o-mini

    and GPT-4o-mini (OpenAI., 2024)) for QA task. 14 Under review as a conference paper at ICLR 2026 A.4 QA PERFORMANCE ONGPT-4O-MINI Table 8: QA performance on GPT-4o-mini. Average means the average performance across all the datasets. We highlight the best results withboldand the second best results with under line . MuSiQue 2Wiki HotpotQA Average Method EM...

  36. [36]

    The best result is shown inbold

    A.5 MULTI-HOPPERFORMANCE ON2WIKI Table 9: Performance of different hop numbers on 2Wiki. The best result is shown inbold. 2Wiki(Recall@5) 2Wiki(F1) 2Wiki(EM) Method 2-hop 4-hop 2-hop 4-hop 2-hop 4-hop NV-Embed-v2 81.40 63.52 63.21 42.19 59.28 23.35 GFM-RAG 83.37 56.09 64.36 33.11 54.44 18.82 RAPTOR 85.91 64.75 66.13 40.14 53.71 20.12 GraphRAG - - 67.99 39...

  37. [37]

    chunk” means performing QA task with chunk. “w/o inter

    As for the chunk retrieval (QSGNN + chunk, w/o inter or w/o doc), we retrieve the top 10 most relevant chunks for QA and we only report the EM and F1 score since we have no ground truth for chunk retrieval. We find that QSGNN(chunk) does not perform as well as QSGNN. It can be attributed to two factors: i) QSGNN is not directly trained on chunk labels. ii...

  38. [38]

    The receptive field limits the potential of QSGNN

    We find that 1-layer QSGNN (one intra-level + one inter-level) have limited performance and the gap between 2-layer QSGNN becomes bigger as the hop number increase, it is because 1-layer QSGNN can only aggregate 2-hop information. The receptive field limits the potential of QSGNN. The 2-layer QSGNN achieves the best performance among 2,3,4 hop questions b...

  39. [39]

    16 Under review as a conference paper at ICLR 2026 Table 12: The influence of negative sampling number

    shows limited benefits, because excessive easy negatives fail to provide meaningful guidance for QSGNN training. 16 Under review as a conference paper at ICLR 2026 Table 12: The influence of negative sampling number. MuSiQue(Recall@5) MuSiQue(F1) MuSiQue(EM) Method 2-hop 3-hop 4-hop 2-hop 3-hop 4-hop 2-hop 3-hop 4-hop HippoRAG2 79.89 73.96 48.32 53.01 44....

  40. [40]

    As for the pre-training scale, the results show that insufficient pre-training data leads to sub-optimal performance across all the dimensions

    We conduct experiments on the MuSiQue dataset. As for the pre-training scale, the results show that insufficient pre-training data leads to sub-optimal performance across all the dimensions. As the amount of pre-training data increases, the performance of QSGNN gets better, although the marginal improvement decreases. As for the information dimension, the...

  41. [41]

    born in South Carolina

    Figure 5 shows QSGNN achieving perfect recall@5 while retrieval of HippoRAG2 is inaccurate. For the 4-hop question about Andrew Deveaux’s birthplace, QSGNN integrates evi- dence across multiple documents and each document describes one key evidence for the question: i)biographical data (“born in South Carolina”),ii)state capital history (“Columbia became ...

  42. [42]

    Vilaiyaadu Mankatha\

    Another version says that it was named by Juan Crespí on account of a pair of springs, the Kuruvungna Springs (Serra Springs), that were reminiscent of the tears that Saint Monica shed over her son's early impiety. Answer :August 3, 1769 HippoRAG2 Retrieval : Vilaiyaadu Mankatha\nFour songs were included as bonus tracks to the single release of \"Vilaiyaa...

  43. [43]

    Rio de Orellana.\

    was a Spanish explorer and conquistador. He completed the first known navigation of the entire length of the Amazon River, which initially was named \"Rio de Orellana.\" He also founded the city of Guayaquil in what is now Ecuador. Jive Records\nJive Records was an American record label under the RCA Music Group formed in 1981 by Zomba Records. Formerly h...

  44. [44]

    The invention of the cotton gin in 1793 revolutionized the processing of this crop, making short-staple cotton profitable

    Charleston, South Carolina\nAlthough the city lost the status of state capital to Columbia in 1786, Charleston became even more prosperous in the plantation-dominated economy of the post- Revolutionary years. The invention of the cotton gin in 1793 revolutionized the processing of this crop, making short-staple cotton profitable. It was more easily grown ...

  45. [45]

    The city serves as the county seat of Richland County, and a portion of the city extends into neighboring Lexington County. It is the center of the Columbia metropolitan statistical area, which had a population of 767,598 as of the 2010 United States Census, growing to 817,488 by July 1, 2016, according to 2015 U.S. Census estimates. The name Columbia is ...

  46. [46]

    Freikorps

    Figure 7 demonstrates QSGNN’s failure to retrieve documents containing “Freikorps” due to a misspelled query terminology (“free crops” should be “Freikorps”, we changed the “free crops” to “Freikorps” then QSGNN could retrieve correctly). This spelling error caused QSGNN to prioritize non-critical entities like “democratic government” and “Germany”. We al...

  47. [47]

    les députés protestataires\

    The movement started with the first election for the Reichstag; those elected were called "les députés protestataires\", and until the fall of Bismarck in 1890, they were the only deputies elected by the Alsatians to the German parliament demanding the return of those territories to France. At the last Reichstag election in Strasbourg and its periphery, t...

  48. [48]

    In Germany, the revolt is often called People's Uprising in East Germany (Volksaufstand in der DDR)

    It turned into a widespread uprising against the German Democratic Republic government the next day. In Germany, the revolt is often called People's Uprising in East Germany (Volksaufstand in der DDR). It involved more than one million people in about 700 localities. 17 June was declared a day of national remembrance in West Germany up until reunification...

  49. [49]

    The commanders - in - chief exercised supreme authority in their respective zones and acted in concert on questions affecting the whole country

    History of Germany (1945–1990)\nThe intended governing body of Germany was called the Allied Control Council. The commanders - in - chief exercised supreme authority in their respective zones and acted in concert on questions affecting the whole country. Berlin, which lay in the Soviet (eastern) sector, was also divided into four sectors with the Western ...

  50. [50]

    Gaboto\" or \

    Between 1919 and 1933 there was no single name for the new state that gained widespread acceptance, which is precisely why the old name ``Deutsches Reich ''continued in existence even though hardly anyone used it during the Weimar period. To the right of the spectrum the politically engaged rejected the new democratic model and cringed to see the honour o...

  51. [51]

    San Clemente

    Figure 8 reveals QSGNN’s limitation in processing specific term (“San Clemente”, “Fuser”, “Alberto”), which may not be well represented by QSGNN since they are absent from 20 Under review as a conference paper at ICLR 2026 Query: When was the person who Messi's goals in Copa del Rey compared to get signed by Barcelona? QSGNN Retrieval: FC Barcelona\nDespi...

  52. [52]

    A subsequent 5 -- 1 aggregate defeat against Athletic Bilbao in the Supercopa de España ended their expressed hopes of a second sextuple, with Messi scoring his side's only goal

    Lionel Messi\nMessi opened the 2015 -- 16 season by scoring twice from free kicks in Barcelona's 5 -- 4 victory (after extra time) over Sevilla in the UEFA Super Cup. A subsequent 5 -- 1 aggregate defeat against Athletic Bilbao in the Supercopa de España ended their expressed hopes of a second sextuple, with Messi scoring his side's only goal. On 16 Septe...

  53. [53]

    province

    Now playing in all competitions, he befriended his teammates, among whom were Cesc Fàbregas and Gerard Piqué. After completing his growth hormone treatment aged 14, Messi became an integral part of the ``Baby Dream Team '', Barcelona's greatest - ever youth side. During his first full season (2002 -- 03), he was top scorer with 36 goals in 30 games for th...

  54. [54]

    Messi” and “Barcelona

    Figure 9 shows a typical bad case for QSGNN. In this case the term like “Messi” and “Barcelona” dwarf the key evidence “Diego Maradona, who Messi can bring comparison to”. The sequential dependency between finding “Diego Maradona” and subsequent evidence (“June 1982 transfer record”) further complicates retrieval. It is difficult for query alignment to id...

  55. [55]

    sentences

    through query decomposition and iterative retrieval may mitigate the problem. However, the first solution is a tricky strategy and even may compromise the performance since subgraph sampling will lead to information loss. The second solution introduces CoT into framework, which is beyond the design topic of QSGNN. We will leave the combination of these tw...

  56. [56]

    named_entities

    It plays Hindi, English and regional songs. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features. Sentence Extraction Prompt Figure 10:Sentence extraction prompt for OpenIE. Goal: Your task is to extract named entities ...

  57. [57]

    It plays Hindi, English and regional songs. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features. Entity Extraction Prompt Figure 11:Entity extraction prompt for OpenIE. A.11 PROMPTS FORSYNTHESIZEDPRE-TRAININGDATA The p...

  58. [58]

    triples": [ [

    22 Under review as a conference paper at ICLR 2026 Goal: Example: - Output -: {"triples": [ ["Radio City", "located in", "India"], ["Radio City", "is", "private FM radio station"], ["Radio City", "started on", "3 July 2001"], ["Radio City", "plays songs in", "Hindi"], ["Radio City", "plays songs in", "English"], ["Radio City", "forayed into", "New Media"]...

  59. [59]

    Radio City

    It plays Hindi, English and regional songs. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music- related features. Named_entities: ["Radio City", "India", "3 July 2001", "Hindi", "English", "May 2008", "PlanetRadiocity.com"] Triple Ext...

  60. [60]

    Radio City

    It plays Hindi, English and regional songs. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features. Named_entities: ["Radio City", "India", "3 July 2001", "Hindi", "English", "May 2008", "PlanetRadiocity.com"] One-hop Que...

  61. [61]

    question-answer-doc triples

    Entity list two: ['December 2012', 'May 2013', 'Lionel Messi', 'Bayern Munich', 'Champions League', 'Copa del Rey', 'FC Barcelona', 'Pep Guardiola', 'Spanish', 'Tito Vilanova', 'Real Madrid', 'July'] Common entity list: ['Real Madrid', 'Champions League', 'Copa del Rey', 'FC Barcelona'] - Output -: {"question-answer-doc triples": [ { "question": "Which co...