pith. sign in

arxiv: 2606.26458 · v1 · pith:HZZRIASHnew · submitted 2026-06-24 · 💻 cs.AI

MKG-RAG-Bench: Benchmarking Retrieval in Multimodal Knowledge Graph-Augmented Generation

Pith reviewed 2026-06-26 01:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal knowledge graphretrieval augmented generationbenchmarkmultimodal retrievalknowledge graph RAGquestion answeringlarge language models
0
0 comments X

The pith

Retrieval quality from multimodal knowledge graphs strongly determines the accuracy of answers generated by large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds MKG-RAG-Bench from two multimodal knowledge graphs covering general and medical domains, paired with aligned question-answering sets that let researchers measure retrieval and generation separately. An LLM curation step removes low-value knowledge and creates queries that carry exact supervision across text, image, and graph modalities. Experiments run on multiple retriever families show that multimodal retrieval stays difficult and that higher retrieval scores produce measurably better final answers. The benchmark therefore treats retrieval itself as the primary evaluation target rather than burying it inside end-to-end scores.

Core claim

Experiments on MKG-RAG-Bench establish that effective multimodal retrieval remains challenging yet crucial for end-to-end MKG-RAG performance, and that retrieval quality strongly determines generation outcomes.

What carries the argument

MKG-RAG-Bench, a cross-domain benchmark built from multimodal knowledge graphs that supplies aligned QA datasets and supports isolated measurement of retrieval quality before generation.

If this is right

  • Higher-performing multimodal retrievers will raise the accuracy of answers produced by MKG-RAG systems.
  • Retrievers tuned only on unstructured text will continue to underperform when knowledge is distributed across modalities inside a graph.
  • Generation quality tracks retrieval quality across both general and medical domains.
  • Developers can now diagnose whether failures originate in retrieval or in the generator by using the benchmark's controlled splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Medical MKG-RAG applications may require extra safeguards on image-text alignment that the current benchmark only begins to expose.
  • The same retrieval-first evaluation approach could be applied to other structured knowledge sources beyond the two graphs tested here.
  • System builders should allocate more engineering effort to retrieval modules when sources mix text, images, and relations.

Load-bearing premise

The LLM-based curation pipeline reliably filters low-utility knowledge and generates structurally grounded queries with exact supervision without introducing systematic biases or alignment errors across modalities.

What would settle it

Running the benchmark's test sets through several retrievers and finding no consistent correlation between standard retrieval metrics and downstream answer accuracy would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.26458 by Bao Hoang, Fenglong Ma, Han Liu, Ting Wang, Xiaochen Wang.

Figure 1
Figure 1. Figure 1: The proposed pipeline for benchmark construction using multimodal knowledge graphs and LLMs. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: BLEU (%) comparison on (a) VQA-RAD and (b) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two-agent prompting for query construction: filtering low-quality triplets and generating WH-questions for the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) over knowledge graphs has emerged as a promising approach for grounding large language models, yet existing benchmarks largely overlook the challenges of retrieval in multimodal knowledge graph RAG (MKG-RAG). In practice, retrieval is a critical bottleneck: multimodal knowledge is heterogeneous, difficult to align across modalities, and often poorly served by retrievers designed for unstructured corpora. To address this gap, we introduce MKG-RAG-Bench, a cross-domain benchmark explicitly designed to evaluate retrieval in MKG-RAG. MKG-RAG-Bench is constructed from two multimodal knowledge graphs spanning general and medical domains, and includes carefully aligned question-answering datasets that support controlled evaluation of both retrieval and downstream generation. The benchmark is built using an LLM-based curation pipeline that filters low-utility knowledge, generates structurally grounded queries with exact supervision, and systematically covers diverse modality configurations. Through extensive experiments across representative retriever families and modality settings, we show that effective multimodal retrieval remains challenging yet crucial for end-to-end MKG-RAG performance, and that retrieval quality strongly determines generation outcomes. By isolating retrieval as a first-class evaluation target, MKG-RAG-Bench provides a principled foundation for diagnosing current limitations and advancing multimodal knowledge graph RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MKG-RAG-Bench, a cross-domain benchmark for evaluating retrieval in multimodal knowledge graph-augmented generation (MKG-RAG). It is constructed from two multimodal knowledge graphs (general and medical domains) via an LLM-based curation pipeline that filters low-utility triples and generates structurally grounded queries with exact supervision. The benchmark includes aligned QA datasets supporting controlled retrieval and generation evaluation. Experiments across retriever families and modality settings are used to claim that effective multimodal retrieval remains challenging yet crucial for end-to-end MKG-RAG performance and that retrieval quality strongly determines generation outcomes.

Significance. If the curation pipeline proves reliable, the benchmark would address a clear gap by isolating retrieval as a first-class target in MKG-RAG, where modality heterogeneity and alignment pose documented difficulties. The empirical scope across representative retrievers and domains provides a concrete starting point for diagnosing limitations, which is valuable for a benchmark paper.

major comments (2)
  1. [Abstract / Benchmark Construction] Abstract / Benchmark Construction: The central claim that retrieval quality strongly determines generation outcomes depends on MKG-RAG-Bench being a faithful testbed. The LLM-based curation pipeline is described as filtering low-utility knowledge and generating structurally grounded queries, yet the manuscript provides no quantitative human validation, inter-annotator agreement, or error analysis of modality alignment or query grounding. This is load-bearing for interpreting all reported performance gaps and correlations.
  2. [Experiments] Experiments section: The abstract asserts that experiments demonstrate retrieval quality strongly determines outcomes, but without reported quantitative metrics, ablation on the curation pipeline, or analysis of potential LLM-induced biases in the testbed, it is not possible to verify whether the observed gaps reflect intrinsic retrieval hardness or artifacts of the construction process.
minor comments (1)
  1. [Abstract] Abstract: Consider adding one or two key quantitative results (e.g., retrieval metrics or correlation coefficients) to give readers an immediate sense of effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which underscores the need for rigorous validation in benchmark construction. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / Benchmark Construction] The central claim that retrieval quality strongly determines generation outcomes depends on MKG-RAG-Bench being a faithful testbed. The LLM-based curation pipeline is described as filtering low-utility knowledge and generating structurally grounded queries, yet the manuscript provides no quantitative human validation, inter-annotator agreement, or error analysis of modality alignment or query grounding. This is load-bearing for interpreting all reported performance gaps and correlations.

    Authors: We agree that the absence of quantitative human validation limits the strength of claims about the benchmark's fidelity. In the revision we will add a human evaluation on a sampled subset of curated triples and queries, reporting inter-annotator agreement and a detailed error analysis of modality alignment and query grounding. These results will be presented alongside the existing pipeline description. revision: yes

  2. Referee: [Experiments] The abstract asserts that experiments demonstrate retrieval quality strongly determines outcomes, but without reported quantitative metrics, ablation on the curation pipeline, or analysis of potential LLM-induced biases in the testbed, it is not possible to verify whether the observed gaps reflect intrinsic retrieval hardness or artifacts of the construction process.

    Authors: We concur that additional quantitative support is required to substantiate the reported correlations. The revised manuscript will include ablations isolating components of the curation pipeline together with an analysis of LLM-induced biases, accompanied by the corresponding quantitative metrics. These additions will help distinguish intrinsic retrieval challenges from construction artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark construction and empirical evaluation are self-contained

full rationale

The paper constructs MKG-RAG-Bench from existing multimodal KGs using an LLM curation pipeline, then runs controlled experiments on retriever families. No equations, fitted parameters, or predictions are present. Central claims rest on direct empirical measurements of retrieval and generation performance rather than any self-referential derivation or self-citation chain. The LLM pipeline is an input construction step, not a load-bearing derivation that reduces to its own outputs by definition. This matches the default case of a non-circular empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the LLM curation process produces faithful, unbiased QA pairs; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption LLM-based curation produces reliable, unbiased queries with exact supervision from the knowledge graph
    Invoked in the description of the benchmark construction pipeline.

pith-pipeline@v0.9.1-grok · 5765 in / 1151 out tokens · 29230 ms · 2026-06-26T01:09:48.649338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): inte- grating biomedical terminology.Nucleic Acids Res.32, Database issue (Jan. 2004), D267–70

  2. [2]

    Jiangjie Chen, Rui Xu, Ziquan Fu, Wei Shi, Zhongqiao Li, Xinbo Zhang, Changzhi Sun, Lei Li, Yanghua Xiao, and Hao Zhou. 2022. E-KAR: A benchmark for ratio- nalizing natural language analogical reasoning.arXiv preprint arXiv:2203.08480 (2022)

  3. [3]

    Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William Cohen. 2022. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 5558–5570

  4. [4]

    Zhanpeng Chen, Chengjin Xu, Yiyan Qi, Xuhui Jiang, and Jian Guo. 2025. VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training. InFindings of the Association for Computational Linguistics: EMNLP 2025. 8140–8158

  5. [5]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multi- modality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261(2025)

  6. [6]

    Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, and Yong Liu. 2025. Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering.arXiv preprint arXiv:2505.16470(2025)

  7. [7]

    Aleksandr Drozd, Anna Gladkova, and Satoshi Matsuoka. 2016. Word embed- dings, analogies, and machine learning: Beyond king-man+ woman= queen. In Proceedings of coling 2016. 3519–3530

  8. [8]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

  9. [9]

    Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, and Prem Natarajan. 2022. Transform-Retrieve-Generate: Natural Language-Centric KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Xiaochen Wang, Bao Hoang, Han Liu, Ting Wang, and Fenglong Ma Outside-Knowledge Visual Question Answering. InCVPR 2022. IEEE, 5057–5067

  10. [10]

    Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond iid: three levels of generalization for question answering on knowledge bases. InProceedings of the web conference 2021. 3477–3488

  11. [11]

    Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-retriever: Retrieval-augmented gen- eration for textual graph understanding and question answering.Advances in Neural Information Processing Systems37 (2024), 132876–132907

  12. [12]

    Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao

  13. [13]

    arXiv:2405.16506 [cs.LG] https://arxiv.org/abs/2405.16506

    GRAG: Graph Retrieval-Augmented Generation. arXiv:2405.16506 [cs.LG] https://arxiv.org/abs/2405.16506

  14. [14]

    Alistair E W Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-Ying Deng, Roger G Mark, and Steven Horng. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Sci. Data6, 1 (Dec. 2019), 317

  15. [15]

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. doi:10.48550/ARXIV.2004.04906

  16. [16]

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images.Scientific data5, 1 (2018), 1–10

  17. [17]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  18. [18]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900

  19. [19]

    Mufei Li, Siqi Miao, and Pan Li. 2024. Simple is effective: The roles of graphs and large language models in knowledge-graph-based retrieval-augmented genera- tion.arXiv preprint arXiv:2410.20724(2024)

  20. [20]

    Shiyang Li, Yifan Gao, Haoming Jiang, Qingyu Yin, Zheng Li, Xifeng Yan, Chao Zhang, and Bing Yin. 2023. Graph Reasoning for Question Answering with Triplet Retrieval. InACL (Findings)

  21. [21]

    Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. [n. d.]. Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent. InThe Thirteenth ICLR

  22. [22]

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021. SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Ques- tion Answering. arXiv:2102.09542 [cs.CV] https://arxiv.org/abs/2102.09542

  23. [23]

    Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio, and David S Rosenblum. 2019. MMKG: multi-modal knowledge graphs. InEuropean Semantic Web Conference. Springer, 459–474

  24. [24]

    Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Reasoning on graphs: Faithful and interpretable large language model reasoning.arXiv preprint arXiv:2310.01061(2023)

  25. [25]

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. 2024. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems37 (2024), 95963–96010

  26. [26]

    Zi-Ao Ma, Tian Lan, Rong-Cheng Tu, Yong Hu, Yu-Shi Zhu, Tong Zhang, Heyan Huang, Zhijing Wu, and Xian-Ling Mao. 2024. Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines. arXiv preprint arXiv:2411.16365(2024)

  27. [27]

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of CVPR. 3195–3204

  28. [28]

    Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. 2025. A Survey of Multimodal Retrieval-Augmented Generation. doi:10.48550/ARXIV.2504.08748

  29. [29]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  30. [30]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. doi:10.48550/ARXIV.1908.10084

  31. [31]

    Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends®in Information Retrieval3, 4 (2009), 333–389. doi:10.1561/1500000019

  32. [32]

    Naganand Yadati Sanket Shah, Anand Mishra and Partha Pratim Talukdar. 2019. KVQA: Knowledge-Aware Visual Question Answering. InAAAI

  33. [33]

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text m...

  34. [34]

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV. Springer, 146–162

  35. [35]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

  36. [36]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

  37. [37]

    Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions.arXiv preprint arXiv:1803.06643(2018)

  38. [38]

    Qwen Team. 2026. Qwen3.5: Accelerating Productivity with Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

  39. [39]

    Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, et al . 2025. CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark.arXiv preprint arXiv:2510.26160(2025)

  40. [40]

    Xin Wang, Benyuan Meng, Hong Chen, Yuan Meng, Ke Lv, and Wenwu Zhu

  41. [41]

    InProceedings of MM’23

    TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio. InProceedings of MM’23. Association for Computing Machinery, New York, NY, USA, 2391–2399. doi:10.1145/3581783.3612266

  42. [42]

    Xiaochen Wang, Zongyu Wu, Yuan Zhong, Xiang Zhang, Suhang Wang, and Fen- glong Ma. 2026. GPR: Empowering Generation with Graph-Pretrained Retriever. InProceedings of the ACM Web Conference 2026. 8349–8352

  43. [43]

    Xiaochen Wang, Yuan Zhong, Lingwei Zhang, Lisong Dai, Ting Wang, and Fenglong Ma. 2025. MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph. arXiv:2505.17214 [cs.AI] https://arxiv.org/ abs/2505.17214

  44. [44]

    Navve Wasserman, Roi Pony, Oshri Naparstek, Adi Raz Goldfarb, Eliyahu Schwartz, Udi Barzelay, and Leonid Karlinsky. 2025. REAL-MM-RAG: A Real- World Multi-Modal Retrieval Benchmark. InAnnual Meeting of the Association for Computational Linguistics

  45. [45]

    Yibin Yan and Weidi Xie. 2024. EchoSight: Advancing Visual-Language Models with Wiki Knowledge. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, 1538–1551

  46. [46]

    Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. InProceedings of the 54th ACL (Volume 2: Short Papers). 201–206

  47. [47]

    Wenjia Zhai. 2024. Self-adaptive Multimodal Retrieval-Augmented Generation. doi:10.48550/ARXIV.2410.11321

  48. [48]

    Ningyu Zhang, Lei Li, Xiang Chen, Xiaozhuan Liang, Shumin Deng, and Hua- jun Chen. 2023. Multimodal Analogical Reasoning over Knowledge Graphs. arXiv:2210.00312 [cs.CL] https://arxiv.org/abs/2210.00312

  49. [49]

    Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, and Dong Yu. 2025. Docbench: A benchmark for evaluating llm-based document reading systems. InProceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing. 359–373. A Source Knowledge Graphs We use the following two knowledg...

  50. [50]

    keep": bool,

    Output MUST be JSON:{"keep": bool, "reason": str}

  51. [51]

    Return ONLY JSON (no extra keys)

  52. [52]

    Decide keep based primarily on relation_text and tail examples

  53. [53]

    Because head_text may be uninformative for images, DO NOT reject just because head_text is generic

  54. [54]

    Set keep=false only when the tails are mostly placeholders or administrative/meta concepts, e.g., tails like ‘thing/object/entity/space/place/environment/location/category’, or when the relation is too vague. Input: GROUP TYPE: {mask_tail or mask_relation} head_text: {head_text} relation_text: {relation_text} tail_text (if mask_relation): {tail_text} cand...

  55. [55]

    questions

    Output MUST be JSON:{"questions": [..]}

  56. [56]

    Do NOT copy relation_text (or candidate relation strings) verbatim; paraphrase

  57. [57]

    6)≤25 words per question

    If mask_tail: Questions MUST NOT mention any specific tail candidates. 6)≤25 words per question. Avoid yes/no

  58. [58]

    Refer to the head as ‘{head_ref}’ (or equivalent if multimodal)

  59. [59]

    Of which global or intergovernmental bodies is Antigua and Barbuda a member?

    If mask_relation: You MAY mention tail_text. Input: GROUP TYPE: {mask_tail or mask_relation} head_text: {head_text} relation_text: {relation_text} tail_text (if mask_relation): {tail_text} candidates_count: {cand_total} examples (random 3): {cand_examples} candidates_sample: {cand_sample} K={k} Figure 3: Two-agent prompting for query construction: filteri...