Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge

Derian Boer; Stefan Kramer; Stephen Roth

arxiv: 2505.09246 · v4 · pith:LEX2UQYUnew · submitted 2025-05-14 · 💻 cs.IR · cs.AI· cs.CL

Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge

Derian Boer , Stephen Roth , Stefan Kramer This is my paper

Pith reviewed 2026-05-22 15:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords multi-hop question answeringsemi-structured knowledge basesretrieval pipelinelarge language modelsscope expansionhybrid retrievalzero-shot qaconstraint extraction

0 comments

The pith

A new retrieval pipeline parses questions with language models then expands search scope step by step to answer multi-hop queries over semi-structured knowledge more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a system for answering questions that require chaining facts from both organized data like tables and free text documents connected to them. The approach uses large language models to identify important entities and conditions in the question, then broadens the pool of possible answers gradually while enforcing those conditions before ranking them with similarity measures. Readers might care because many practical questions involve such mixed sources, and better retrieval means more accurate answers with little or no extra training. The method reports higher success in finding the correct answer first compared to existing techniques on three different test sets.

Core claim

AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps which are then supplemented and ranked through four additional processing steps. It constantly adjusts the focus like an optical autofocus by leveraging large language models to extract entity attributes and relational constraints for both initial parsing and later reranking, vector similarity search for ranking entities and answers, a novel incremental scope expansion procedure, and a hybrid retrieval strategy that reduces error susceptibility.

What carries the argument

Incremental scope expansion procedure that prepares a configurable amount of suitable candidates fulfilling the given constraints the most, integrated with large language model extraction for parsing and reranking.

If this is right

The hybrid strategy makes the overall retrieval less prone to failures in any single component.
Strong results persist in zero-shot and one-shot settings without task-specific fine-tuning.
Ablation studies show how individual elements such as different reranking approaches contribute to the outcome.
The configurable candidate volume lets users balance speed and accuracy for specific needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same step-by-step expansion logic could apply to retrieval tasks that do not require multiple hops.
Trying the pipeline on knowledge bases from additional domains would test how well the constraint extraction generalizes.
Improvements in language model accuracy for constraint parsing would likely raise the upper limit on final answer quality.

Load-bearing premise

Large language models can extract entity attributes and relational constraints from questions reliably enough that errors do not degrade the incremental expansion and final ranking steps.

What would settle it

Direct measurement of average first-hit rates on the STaRK QA benchmarks showing that the new pipeline does not exceed the second-best method by a substantial margin would indicate the performance gains do not hold.

Figures

Figures reproduced from arXiv: 2505.09246 by Derian Boer, Stefan Kramer, Stephen Roth.

**Figure 2.** Figure 2: Hit@20 and recall@20 of AF-Retriever depending on [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: Hit@20 and recall@20 of AF-Retriever depending on [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

read the original abstract

In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data. In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics. AF-Retriever's average first-hit rate surpasses the second-best method by 32.1%. Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-k answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure that prepares for the reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility. In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps. An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights. The source code is available at https://github.com/kramerlab/AF-Retriever .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AF-Retriever delivers a practical pipeline with strong benchmark results on semi-structured multi-hop QA, but the gains rest on LLM extraction that lacks separate validation.

read the letter

This paper gives you a concrete pipeline for multi-hop question answering when your knowledge base mixes structured elements like graphs or tables with plain text documents. The Autofocus-Retriever chains LLM-based extraction of entities and constraints, vector similarity ranking, an incremental scope expansion step, and a hybrid retrieval approach. It reports the strongest zero-shot and one-shot numbers on the three STaRK QA benchmarks, with a 32.1% average improvement in first-hit rate over the runner-up. The new part is the specific four-step constraint-driven sequence plus the way they prepare candidates for reranking through scope expansion. Individual pieces like LLM parsing or hybrid search have been tried before, but the full integration with configurable expansion depth looks like a fresh engineering choice for this setting. They back it up with an ablation study on the reranking LLMs and a detailed error analysis. The code is public on GitHub, which lets you test it directly. That combination makes the work more useful than papers that just describe an idea without the extras. The main concern is how much the results depend on the LLM extraction working cleanly. The pipeline starts by having the model pull out attributes and relations from the question. If that step misses key multi-hop connections on any of the benchmark domains, the later expansion and ranking stages run on a weaker candidate set. The paper includes error analysis and compares three reranking strategies, but it does not appear to include an independent accuracy check on the initial parsing against gold labels. That leaves open the possibility that some of the reported lift comes from the LLM performing well on these particular datasets rather than from the retrieval mechanics alone. This is the kind of paper that will interest people building real QA systems for semi-structured data in domains like enterprise search or scientific literature. Practitioners who need a modular, configurable retriever will get value from the design and the benchmark numbers. It is worth sending to peer review. The evaluation is grounded in public benchmarks, the components are tested, and the code is available, so referees can dig into the details and check the robustness claims.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Autofocus-Retriever (AF-Retriever), a modular pipeline for multi-hop question answering over semi-structured knowledge bases (SKBs). It integrates LLM-based extraction of entity attributes and relational constraints for parsing and reranking, vector similarity search, a novel incremental scope expansion procedure, and a hybrid retrieval strategy. The work reports state-of-the-art zero- and one-shot results on three STaRK QA benchmarks spanning diverse domains, with an average first-hit rate 32.1% higher than the second-best baseline. An ablation study and error analysis (including three LLM reranking variants) are included, and code is released.

Significance. If the performance claims hold after addressing validation gaps, the framework would represent a practical advance in hybrid structured-unstructured retrieval for complex QA. The configurable, constraint-driven design and emphasis on reducing error propagation could influence future SKB-based systems, while the open-source release and component-level analysis add reproducibility value.

major comments (2)

[Section 3 (Parsing and Constraint Extraction)] The central performance claim (32.1% first-hit lift) rests on four constraint-driven retrieval steps that begin with LLM extraction of entity attributes and relational constraints. No independent gold-standard annotation or accuracy metric is reported for this initial parsing step on the STaRK domains, leaving open the possibility that extraction errors systematically limit candidate pools and that gains partly reflect LLM behavior rather than the hybrid design.
[Section 4.2 (Incremental Scope Expansion) and Section 5 (Ablation Study)] The incremental scope expansion and subsequent vector-ranking stages operate directly on the output of the LLM parser. Without a quantitative breakdown of how often critical multi-hop relations are missed in the extraction phase, it is difficult to attribute the reported improvements to the scope-expansion procedure versus favorable parsing on the evaluation sets.

minor comments (2)

[Abstract and Section 1] The abstract and introduction would benefit from a concise diagram or pseudocode outlining the eight processing steps (four retrieval + four post-processing) to clarify the overall flow for readers.
[Section 5 (Error Analysis and Reranking Comparison)] Clarify whether the three LLM reranking strategies were chosen post-hoc or pre-specified, and report statistical significance (e.g., paired t-tests or bootstrap intervals) for the 32.1% average improvement across the three benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the identification of validation gaps in the parsing component and the need for better attribution of improvements. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to address them.

read point-by-point responses

Referee: [Section 3 (Parsing and Constraint Extraction)] The central performance claim (32.1% first-hit lift) rests on four constraint-driven retrieval steps that begin with LLM extraction of entity attributes and relational constraints. No independent gold-standard annotation or accuracy metric is reported for this initial parsing step on the STaRK domains, leaving open the possibility that extraction errors systematically limit candidate pools and that gains partly reflect LLM behavior rather than the hybrid design.

Authors: We agree that reporting an independent accuracy metric for the LLM-based extraction of entity attributes and relational constraints would strengthen the manuscript and help isolate the contribution of the hybrid retrieval design. The current version includes a detailed error analysis that qualitatively examines parsing failures and their impact on retrieval, along with comparisons of LLM reranking variants. However, we did not provide a quantitative gold-standard evaluation of extraction accuracy on the STaRK domains. To address this, we will add a new subsection reporting manual annotation results on a stratified sample of queries from each benchmark, including precision and recall for extracted constraints. This will clarify the reliability of the parsing step. revision: yes
Referee: [Section 4.2 (Incremental Scope Expansion) and Section 5 (Ablation Study)] The incremental scope expansion and subsequent vector-ranking stages operate directly on the output of the LLM parser. Without a quantitative breakdown of how often critical multi-hop relations are missed in the extraction phase, it is difficult to attribute the reported improvements to the scope-expansion procedure versus favorable parsing on the evaluation sets.

Authors: We acknowledge that a quantitative breakdown of missed critical multi-hop relations during extraction would improve attribution of gains to the incremental scope expansion versus parsing quality. The ablation study in Section 5 already isolates the effect of scope expansion through controlled variants (with and without the procedure), and the error analysis discusses cases of incomplete constraint coverage. Nevertheless, we agree that explicitly quantifying extraction misses in failure cases would be valuable. We will extend the error analysis to include such a breakdown, based on manual review of a sample of unsuccessful queries, reporting the frequency with which key relations were missed in the initial parsing step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents AF-Retriever as a modular pipeline combining LLM extraction, vector search, incremental scope expansion, and hybrid retrieval, with performance measured directly on the external STaRK QA benchmarks. These benchmarks are independent evaluation standards spanning multiple domains and metrics; the reported first-hit rate gains are not derived from or equivalent to any internal fitted parameters, self-defined quantities, or predictions that reduce to the method's own inputs by construction. No equations appear in the abstract or described components that would create self-definitional loops, and ablations plus error analysis evaluate component contributions against the same external benchmarks rather than assuming success tautologically. The derivation chain is therefore self-contained through procedural description and external validation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method depends on standard assumptions that LLMs produce usable extractions and that vector similarity plus constraint matching surface relevant candidates; no new physical entities or ad-hoc constants are introduced beyond typical retrieval hyperparameters such as top-k size and scope-expansion depth.

free parameters (2)

top-k candidate size
Configurable number of candidates prepared for reranking; chosen to balance coverage and compute.
scope expansion depth
Number of incremental expansion steps; directly affects how many constraint-satisfying items are collected before LLM reranking.

axioms (2)

domain assumption LLMs can extract entity attributes and relational constraints from natural language questions with sufficient accuracy for retrieval guidance.
Invoked when the pipeline uses LLMs for both parsing and reranking steps.
domain assumption Vector similarity search ranks extracted entities and final answers in a way that aligns with human relevance judgments on the benchmarks.
Underlying the ranking components described in the abstract.

pith-pipeline@v0.9.0 · 5843 in / 1523 out tokens · 28508 ms · 2026-05-22T15:44:08.560007+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps... incremental scope expansion procedure... hybrid retrieval strategy
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Can knowledge graphs reduce hallucinations in llms?: A survey

Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, and Huan Liu. Can knowledge graphs reduce hallucinations in llms?: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Volume1: Long Papers), pp. 3947–3960,

work page 2024
[2]

Y. K. Chia, P. Hong, L. Bing, and S. Poria. InstructEval: Towards holistic evaluation of instruction-tuned large language models. InProc. of the First Workshopon the Scaling Behavior of Large Language Models (SCALE-LLM 2024), pp. 35–64,

work page 2024
[3]

8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

Alfred Clemedtson and Borun Shi. Graphraft: Retrieval augmented fine-tuning for knowledge graphs on graph databases. arXiv preprint arXiv:2504.05478,

work page arXiv
[4]

In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity

Philip Feldman, James R Foulds, and Shimei Pan. Trapping llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085,

work page arXiv
[5]

Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs.arXiv preprint arXiv:2309.03118,

Chao Feng, Xinyu Zhang, and Zichu Fei. Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs.arXiv preprint arXiv:2309.03118,

work page arXiv
[6]

Cypher: An evolving query language for property graphs

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. InProceedings of the 2018 international conference on management of data, pp. 1433– 1445,

work page 2018
[7]

X. Guan, Y. Liu, H. Lin, Y. Lu, B. He, X. Han, and L. Sun. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. InProc. of AAAI 2024, pp. 18126–18134,

work page 2024
[8]

Kundu and U

A. Kundu and U. T. Nguyen. Automated fact checking using a knowledge graph-based model. InProf. of ICAIIC 2024, pp. 709–716,

work page 2024
[9]

Mixture of structural-and-textual retrieval over text-rich graph knowledge bases

Yongjia Lei, Haoyu Han, Ryan A Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, and Yu Wang. Mixture of structural-and-textual retrieval over text-rich graph knowledge bases. arXiv preprint arXiv:2502.20317,

work page arXiv
[10]

Multi-field adaptive retrieval.arXiv preprint arXiv:2410.20056,

14 Millicent Li, Tongfei Chen, Benjamin Van Durme, and Patrick Xia. Multi-field adaptive retrieval.arXiv preprint arXiv:2410.20056,

work page arXiv
[11]

Using multiple RDF knowledge graphs for enriching Chat- GPT responses

Michalis Mountantonakis and Yannis Tzitzikas. Using multiple RDF knowledge graphs for enriching Chat- GPT responses. InProf. of the ECML PKDD 2023 Demo Track, pp. 324–329. Springer,

work page 2023
[12]

Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu

doi: 10.4230/TGDK.1.1.2. Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap.IEEE Trans. Knowl. Data Eng., 36(7):3580–3599,

work page doi:10.4230/tgdk.1.1.2
[13]

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al

doi: 10.1109/TKDE.2024.3352100. Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at TREC-3. British Library Research and Development Department,

work page doi:10.1109/tkde.2024.3352100 2024
[14]

DanishShakeelandNitinJain

doi: 10.1609/aaai.v37i13.27020. DanishShakeelandNitinJain. Fakenewsdetectionandfactverificationusingknowledgegraphsandmachine learning. ResearchGate preprint, 10,

work page doi:10.1609/aaai.v37i13.27020
[15]

FACE-KEG: fact checking explained using knowledge graphs

Nikhita Vedula and Srinivasan Parthasarathy. FACE-KEG: fact checking explained using knowledge graphs. In Prof. of ACM WSDM 2021, pp. 526–534. ACM,

work page 2021
[16]

Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr. Knowledge graph prompting for multi- document question answering.Proc. of AAAI 2024, (17):19206–19214, Mar

work page 2024
[17]

Avatar: Optimizing llm agents for tool usage via contrastive reasoning

Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, and James Y Zou. Avatar: Optimizing llm agents for tool usage via contrastive reasoning. Advances in Neural Information Processing Systems (NeurIPS), 37:25981–26010, 2024a. Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Hu...

work page 2024
[18]

Hansi Yang, Qi Zhang, Wei Jiang, and Jianguo Li

doi: 10.18653/v1/2025.naacl-long.216. Hansi Yang, Qi Zhang, Wei Jiang, and Jianguo Li. Pasemiqa: Plan-assisted agent for question answering on semi-structured data with text and relational information.arXiv preprint arXiv:2502.21087,

work page doi:10.18653/v1/2025.naacl-long.216 2025
[19]

Yasunaga, H

15 M. Yasunaga, H. Ren, A. Bosselut, P. Liang, and J. Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. InProc. of the NAACL 2021, pp. 535–546,

work page 2021
[20]

Zhuang, Z

H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang, and M. Bendersky. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. InProc. of the NAACL 2024, pp. 358–370,

work page 2024
[21]

VSS + Reranker: Vector Similarity Search GPT4 + LMM Reranker(Chia et al., 2024; Zhuang et al.,

work page 2024

[1] [1]

Can knowledge graphs reduce hallucinations in llms?: A survey

Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, and Huan Liu. Can knowledge graphs reduce hallucinations in llms?: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Volume1: Long Papers), pp. 3947–3960,

work page 2024

[2] [2]

Y. K. Chia, P. Hong, L. Bing, and S. Poria. InstructEval: Towards holistic evaluation of instruction-tuned large language models. InProc. of the First Workshopon the Scaling Behavior of Large Language Models (SCALE-LLM 2024), pp. 35–64,

work page 2024

[3] [3]

8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

Alfred Clemedtson and Borun Shi. Graphraft: Retrieval augmented fine-tuning for knowledge graphs on graph databases. arXiv preprint arXiv:2504.05478,

work page arXiv

[4] [4]

In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity

Philip Feldman, James R Foulds, and Shimei Pan. Trapping llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085,

work page arXiv

[5] [5]

Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs.arXiv preprint arXiv:2309.03118,

Chao Feng, Xinyu Zhang, and Zichu Fei. Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs.arXiv preprint arXiv:2309.03118,

work page arXiv

[6] [6]

Cypher: An evolving query language for property graphs

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. InProceedings of the 2018 international conference on management of data, pp. 1433– 1445,

work page 2018

[7] [7]

X. Guan, Y. Liu, H. Lin, Y. Lu, B. He, X. Han, and L. Sun. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. InProc. of AAAI 2024, pp. 18126–18134,

work page 2024

[8] [8]

Kundu and U

A. Kundu and U. T. Nguyen. Automated fact checking using a knowledge graph-based model. InProf. of ICAIIC 2024, pp. 709–716,

work page 2024

[9] [9]

Mixture of structural-and-textual retrieval over text-rich graph knowledge bases

Yongjia Lei, Haoyu Han, Ryan A Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, and Yu Wang. Mixture of structural-and-textual retrieval over text-rich graph knowledge bases. arXiv preprint arXiv:2502.20317,

work page arXiv

[10] [10]

Multi-field adaptive retrieval.arXiv preprint arXiv:2410.20056,

14 Millicent Li, Tongfei Chen, Benjamin Van Durme, and Patrick Xia. Multi-field adaptive retrieval.arXiv preprint arXiv:2410.20056,

work page arXiv

[11] [11]

Using multiple RDF knowledge graphs for enriching Chat- GPT responses

Michalis Mountantonakis and Yannis Tzitzikas. Using multiple RDF knowledge graphs for enriching Chat- GPT responses. InProf. of the ECML PKDD 2023 Demo Track, pp. 324–329. Springer,

work page 2023

[12] [12]

Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu

doi: 10.4230/TGDK.1.1.2. Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap.IEEE Trans. Knowl. Data Eng., 36(7):3580–3599,

work page doi:10.4230/tgdk.1.1.2

[13] [13]

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al

doi: 10.1109/TKDE.2024.3352100. Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at TREC-3. British Library Research and Development Department,

work page doi:10.1109/tkde.2024.3352100 2024

[14] [14]

DanishShakeelandNitinJain

doi: 10.1609/aaai.v37i13.27020. DanishShakeelandNitinJain. Fakenewsdetectionandfactverificationusingknowledgegraphsandmachine learning. ResearchGate preprint, 10,

work page doi:10.1609/aaai.v37i13.27020

[15] [15]

FACE-KEG: fact checking explained using knowledge graphs

Nikhita Vedula and Srinivasan Parthasarathy. FACE-KEG: fact checking explained using knowledge graphs. In Prof. of ACM WSDM 2021, pp. 526–534. ACM,

work page 2021

[16] [16]

Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr. Knowledge graph prompting for multi- document question answering.Proc. of AAAI 2024, (17):19206–19214, Mar

work page 2024

[17] [17]

Avatar: Optimizing llm agents for tool usage via contrastive reasoning

Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, and James Y Zou. Avatar: Optimizing llm agents for tool usage via contrastive reasoning. Advances in Neural Information Processing Systems (NeurIPS), 37:25981–26010, 2024a. Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Hu...

work page 2024

[18] [18]

Hansi Yang, Qi Zhang, Wei Jiang, and Jianguo Li

doi: 10.18653/v1/2025.naacl-long.216. Hansi Yang, Qi Zhang, Wei Jiang, and Jianguo Li. Pasemiqa: Plan-assisted agent for question answering on semi-structured data with text and relational information.arXiv preprint arXiv:2502.21087,

work page doi:10.18653/v1/2025.naacl-long.216 2025

[19] [19]

Yasunaga, H

15 M. Yasunaga, H. Ren, A. Bosselut, P. Liang, and J. Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. InProc. of the NAACL 2021, pp. 535–546,

work page 2021

[20] [20]

Zhuang, Z

H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang, and M. Bendersky. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. InProc. of the NAACL 2024, pp. 358–370,

work page 2024

[21] [21]

VSS + Reranker: Vector Similarity Search GPT4 + LMM Reranker(Chia et al., 2024; Zhuang et al.,

work page 2024