pith. sign in

arxiv: 2505.09246 · v4 · pith:LEX2UQYUnew · submitted 2025-05-14 · 💻 cs.IR · cs.AI· cs.CL

Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge

Pith reviewed 2026-05-22 15:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords multi-hop question answeringsemi-structured knowledge basesretrieval pipelinelarge language modelsscope expansionhybrid retrievalzero-shot qaconstraint extraction
0
0 comments X

The pith

A new retrieval pipeline parses questions with language models then expands search scope step by step to answer multi-hop queries over semi-structured knowledge more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a system for answering questions that require chaining facts from both organized data like tables and free text documents connected to them. The approach uses large language models to identify important entities and conditions in the question, then broadens the pool of possible answers gradually while enforcing those conditions before ranking them with similarity measures. Readers might care because many practical questions involve such mixed sources, and better retrieval means more accurate answers with little or no extra training. The method reports higher success in finding the correct answer first compared to existing techniques on three different test sets.

Core claim

AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps which are then supplemented and ranked through four additional processing steps. It constantly adjusts the focus like an optical autofocus by leveraging large language models to extract entity attributes and relational constraints for both initial parsing and later reranking, vector similarity search for ranking entities and answers, a novel incremental scope expansion procedure, and a hybrid retrieval strategy that reduces error susceptibility.

What carries the argument

Incremental scope expansion procedure that prepares a configurable amount of suitable candidates fulfilling the given constraints the most, integrated with large language model extraction for parsing and reranking.

If this is right

  • The hybrid strategy makes the overall retrieval less prone to failures in any single component.
  • Strong results persist in zero-shot and one-shot settings without task-specific fine-tuning.
  • Ablation studies show how individual elements such as different reranking approaches contribute to the outcome.
  • The configurable candidate volume lets users balance speed and accuracy for specific needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same step-by-step expansion logic could apply to retrieval tasks that do not require multiple hops.
  • Trying the pipeline on knowledge bases from additional domains would test how well the constraint extraction generalizes.
  • Improvements in language model accuracy for constraint parsing would likely raise the upper limit on final answer quality.

Load-bearing premise

Large language models can extract entity attributes and relational constraints from questions reliably enough that errors do not degrade the incremental expansion and final ranking steps.

What would settle it

Direct measurement of average first-hit rates on the STaRK QA benchmarks showing that the new pipeline does not exceed the second-best method by a substantial margin would indicate the performance gains do not hold.

Figures

Figures reproduced from arXiv: 2505.09246 by Derian Boer, Stefan Kramer, Stephen Roth.

Figure 1
Figure 1. Figure 1: Overview of Autofocus-Retriever’s processing from the user query [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hit@20 and recall@20 of AF-Retriever depending on [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hit@20 and recall@20 of AF-Retriever depending on [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
read the original abstract

In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data. In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics. AF-Retriever's average first-hit rate surpasses the second-best method by 32.1%. Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-k answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure that prepares for the reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility. In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps. An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights. The source code is available at https://github.com/kramerlab/AF-Retriever .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Autofocus-Retriever (AF-Retriever), a modular pipeline for multi-hop question answering over semi-structured knowledge bases (SKBs). It integrates LLM-based extraction of entity attributes and relational constraints for parsing and reranking, vector similarity search, a novel incremental scope expansion procedure, and a hybrid retrieval strategy. The work reports state-of-the-art zero- and one-shot results on three STaRK QA benchmarks spanning diverse domains, with an average first-hit rate 32.1% higher than the second-best baseline. An ablation study and error analysis (including three LLM reranking variants) are included, and code is released.

Significance. If the performance claims hold after addressing validation gaps, the framework would represent a practical advance in hybrid structured-unstructured retrieval for complex QA. The configurable, constraint-driven design and emphasis on reducing error propagation could influence future SKB-based systems, while the open-source release and component-level analysis add reproducibility value.

major comments (2)
  1. [Section 3 (Parsing and Constraint Extraction)] The central performance claim (32.1% first-hit lift) rests on four constraint-driven retrieval steps that begin with LLM extraction of entity attributes and relational constraints. No independent gold-standard annotation or accuracy metric is reported for this initial parsing step on the STaRK domains, leaving open the possibility that extraction errors systematically limit candidate pools and that gains partly reflect LLM behavior rather than the hybrid design.
  2. [Section 4.2 (Incremental Scope Expansion) and Section 5 (Ablation Study)] The incremental scope expansion and subsequent vector-ranking stages operate directly on the output of the LLM parser. Without a quantitative breakdown of how often critical multi-hop relations are missed in the extraction phase, it is difficult to attribute the reported improvements to the scope-expansion procedure versus favorable parsing on the evaluation sets.
minor comments (2)
  1. [Abstract and Section 1] The abstract and introduction would benefit from a concise diagram or pseudocode outlining the eight processing steps (four retrieval + four post-processing) to clarify the overall flow for readers.
  2. [Section 5 (Error Analysis and Reranking Comparison)] Clarify whether the three LLM reranking strategies were chosen post-hoc or pre-specified, and report statistical significance (e.g., paired t-tests or bootstrap intervals) for the 32.1% average improvement across the three benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the identification of validation gaps in the parsing component and the need for better attribution of improvements. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to address them.

read point-by-point responses
  1. Referee: [Section 3 (Parsing and Constraint Extraction)] The central performance claim (32.1% first-hit lift) rests on four constraint-driven retrieval steps that begin with LLM extraction of entity attributes and relational constraints. No independent gold-standard annotation or accuracy metric is reported for this initial parsing step on the STaRK domains, leaving open the possibility that extraction errors systematically limit candidate pools and that gains partly reflect LLM behavior rather than the hybrid design.

    Authors: We agree that reporting an independent accuracy metric for the LLM-based extraction of entity attributes and relational constraints would strengthen the manuscript and help isolate the contribution of the hybrid retrieval design. The current version includes a detailed error analysis that qualitatively examines parsing failures and their impact on retrieval, along with comparisons of LLM reranking variants. However, we did not provide a quantitative gold-standard evaluation of extraction accuracy on the STaRK domains. To address this, we will add a new subsection reporting manual annotation results on a stratified sample of queries from each benchmark, including precision and recall for extracted constraints. This will clarify the reliability of the parsing step. revision: yes

  2. Referee: [Section 4.2 (Incremental Scope Expansion) and Section 5 (Ablation Study)] The incremental scope expansion and subsequent vector-ranking stages operate directly on the output of the LLM parser. Without a quantitative breakdown of how often critical multi-hop relations are missed in the extraction phase, it is difficult to attribute the reported improvements to the scope-expansion procedure versus favorable parsing on the evaluation sets.

    Authors: We acknowledge that a quantitative breakdown of missed critical multi-hop relations during extraction would improve attribution of gains to the incremental scope expansion versus parsing quality. The ablation study in Section 5 already isolates the effect of scope expansion through controlled variants (with and without the procedure), and the error analysis discusses cases of incomplete constraint coverage. Nevertheless, we agree that explicitly quantifying extraction misses in failure cases would be valuable. We will extend the error analysis to include such a breakdown, based on manual review of a sample of unsuccessful queries, reporting the frequency with which key relations were missed in the initial parsing step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents AF-Retriever as a modular pipeline combining LLM extraction, vector search, incremental scope expansion, and hybrid retrieval, with performance measured directly on the external STaRK QA benchmarks. These benchmarks are independent evaluation standards spanning multiple domains and metrics; the reported first-hit rate gains are not derived from or equivalent to any internal fitted parameters, self-defined quantities, or predictions that reduce to the method's own inputs by construction. No equations appear in the abstract or described components that would create self-definitional loops, and ablations plus error analysis evaluate component contributions against the same external benchmarks rather than assuming success tautologically. The derivation chain is therefore self-contained through procedural description and external validation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method depends on standard assumptions that LLMs produce usable extractions and that vector similarity plus constraint matching surface relevant candidates; no new physical entities or ad-hoc constants are introduced beyond typical retrieval hyperparameters such as top-k size and scope-expansion depth.

free parameters (2)
  • top-k candidate size
    Configurable number of candidates prepared for reranking; chosen to balance coverage and compute.
  • scope expansion depth
    Number of incremental expansion steps; directly affects how many constraint-satisfying items are collected before LLM reranking.
axioms (2)
  • domain assumption LLMs can extract entity attributes and relational constraints from natural language questions with sufficient accuracy for retrieval guidance.
    Invoked when the pipeline uses LLMs for both parsing and reranking steps.
  • domain assumption Vector similarity search ranks extracted entities and final answers in a way that aligns with human relevance judgments on the benchmarks.
    Underlying the ranking components described in the abstract.

pith-pipeline@v0.9.0 · 5843 in / 1523 out tokens · 28508 ms · 2026-05-22T15:44:08.560007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Can knowledge graphs reduce hallucinations in llms?: A survey

    Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, and Huan Liu. Can knowledge graphs reduce hallucinations in llms?: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Volume1: Long Papers), pp. 3947–3960,

  2. [2]

    Y. K. Chia, P. Hong, L. Bing, and S. Poria. InstructEval: Towards holistic evaluation of instruction-tuned large language models. InProc. of the First Workshopon the Scaling Behavior of Large Language Models (SCALE-LLM 2024), pp. 35–64,

  3. [3]

    8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

    Alfred Clemedtson and Borun Shi. Graphraft: Retrieval augmented fine-tuning for knowledge graphs on graph databases. arXiv preprint arXiv:2504.05478,

  4. [4]

    In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity

    Philip Feldman, James R Foulds, and Shimei Pan. Trapping llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085,

  5. [5]

    Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs.arXiv preprint arXiv:2309.03118,

    Chao Feng, Xinyu Zhang, and Zichu Fei. Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs.arXiv preprint arXiv:2309.03118,

  6. [6]

    Cypher: An evolving query language for property graphs

    Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. InProceedings of the 2018 international conference on management of data, pp. 1433– 1445,

  7. [7]

    X. Guan, Y. Liu, H. Lin, Y. Lu, B. He, X. Han, and L. Sun. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. InProc. of AAAI 2024, pp. 18126–18134,

  8. [8]

    Kundu and U

    A. Kundu and U. T. Nguyen. Automated fact checking using a knowledge graph-based model. InProf. of ICAIIC 2024, pp. 709–716,

  9. [9]

    Mixture of structural-and-textual retrieval over text-rich graph knowledge bases

    Yongjia Lei, Haoyu Han, Ryan A Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, and Yu Wang. Mixture of structural-and-textual retrieval over text-rich graph knowledge bases. arXiv preprint arXiv:2502.20317,

  10. [10]

    Multi-field adaptive retrieval.arXiv preprint arXiv:2410.20056,

    14 Millicent Li, Tongfei Chen, Benjamin Van Durme, and Patrick Xia. Multi-field adaptive retrieval.arXiv preprint arXiv:2410.20056,

  11. [11]

    Using multiple RDF knowledge graphs for enriching Chat- GPT responses

    Michalis Mountantonakis and Yannis Tzitzikas. Using multiple RDF knowledge graphs for enriching Chat- GPT responses. InProf. of the ECML PKDD 2023 Demo Track, pp. 324–329. Springer,

  12. [12]

    Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu

    doi: 10.4230/TGDK.1.1.2. Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap.IEEE Trans. Knowl. Data Eng., 36(7):3580–3599,

  13. [13]

    Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al

    doi: 10.1109/TKDE.2024.3352100. Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at TREC-3. British Library Research and Development Department,

  14. [14]

    DanishShakeelandNitinJain

    doi: 10.1609/aaai.v37i13.27020. DanishShakeelandNitinJain. Fakenewsdetectionandfactverificationusingknowledgegraphsandmachine learning. ResearchGate preprint, 10,

  15. [15]

    FACE-KEG: fact checking explained using knowledge graphs

    Nikhita Vedula and Srinivasan Parthasarathy. FACE-KEG: fact checking explained using knowledge graphs. In Prof. of ACM WSDM 2021, pp. 526–534. ACM,

  16. [16]

    Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr. Knowledge graph prompting for multi- document question answering.Proc. of AAAI 2024, (17):19206–19214, Mar

  17. [17]

    Avatar: Optimizing llm agents for tool usage via contrastive reasoning

    Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis Ioannidis, Karthik Subbian, Jure Leskovec, and James Y Zou. Avatar: Optimizing llm agents for tool usage via contrastive reasoning. Advances in Neural Information Processing Systems (NeurIPS), 37:25981–26010, 2024a. Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Hu...

  18. [18]

    Hansi Yang, Qi Zhang, Wei Jiang, and Jianguo Li

    doi: 10.18653/v1/2025.naacl-long.216. Hansi Yang, Qi Zhang, Wei Jiang, and Jianguo Li. Pasemiqa: Plan-assisted agent for question answering on semi-structured data with text and relational information.arXiv preprint arXiv:2502.21087,

  19. [19]

    Yasunaga, H

    15 M. Yasunaga, H. Ren, A. Bosselut, P. Liang, and J. Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. InProc. of the NAACL 2021, pp. 535–546,

  20. [20]

    Zhuang, Z

    H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang, and M. Bendersky. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. InProc. of the NAACL 2024, pp. 358–370,

  21. [21]

    VSS + Reranker: Vector Similarity Search GPT4 + LMM Reranker(Chia et al., 2024; Zhuang et al.,