pith. machine review for the scientific record. sign in

arxiv: 2604.05818 · v2 · submitted 2026-04-07 · 💻 cs.CV · cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:53 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR
keywords knowledge-based visual question answeringmulti-modal RAGvision-language modelsquery refinementretrieval inspectionEVQAInfoSeekM2KR
0
0 comments X

The pith

WikiSeeker reassigns vision-language models to refine queries using images and inspect retrieval reliability for better knowledge-based visual QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing multi-modal RAG methods for knowledge-based visual question answering underuse VLMs by treating them mainly as answer generators and relying on images alone for retrieval. WikiSeeker instead turns the VLM into two agents: a Refiner that rewrites the textual query conditioned on the input image to strengthen the multimodal retriever, and an Inspector that checks whether retrieved context is reliable enough to pass to a separate LLM or whether the VLM should fall back to its own internal knowledge. Experiments on EVQA, InfoSeek, and M2KR show gains in both retrieval accuracy and final answer quality. A sympathetic reader would care because the approach offers a concrete way to make visual QA systems more robust when the question text and image must be tightly coordinated.

Core claim

WikiSeeker bridges gaps in multi-modal RAG for KB-VQA by proposing a multi-modal retriever and redefining VLMs as specialized agents rather than mere answer generators: the Refiner rewrites the textual query according to the input image to improve retrieval, while the Inspector enables a decoupled generation strategy by routing reliable retrieved context to another LLM and falling back to the VLM's internal knowledge when retrieval is unreliable, yielding state-of-the-art results on EVQA, InfoSeek, and M2KR.

What carries the argument

Two VLM-powered agents—the Refiner that performs image-conditioned textual query rewriting to boost multimodal retrieval, and the Inspector that evaluates retrieved context reliability to decide between external LLM generation and internal VLM knowledge.

If this is right

  • Retrieval accuracy rises when the query is rewritten to align better with visual content.
  • Answer quality improves because generation only uses reliable context routed to a dedicated LLM.
  • The framework produces consistent gains across EVQA, InfoSeek, and M2KR.
  • VLMs can handle both query adaptation and reliability assessment inside RAG pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The Inspector mechanism could reduce hallucinations in other visual tasks by avoiding low-quality retrievals.
  • Applying the same Refiner pattern to captioning or visual reasoning might improve alignment on ambiguous queries.
  • Pairing the Inspector with larger external LLMs would likely widen the gap over pure VLM generation.

Load-bearing premise

The vision-language model can accurately rewrite queries based on the image to improve retrieval and can correctly judge when retrieved context is reliable enough to use versus falling back to its internal knowledge.

What would settle it

Running the same benchmarks with the Refiner and Inspector disabled or replaced by fixed rules and observing no drop in retrieval accuracy or answer quality would show that the specialized VLM roles are not responsible for the reported gains.

Figures

Figures reproduced from arXiv: 2604.05818 by Bin Fan, Kun Ding, Shiming Xiang, Xinming Wang, Yingjian Zhu, Ying Wang.

Figure 1
Figure 1. Figure 1: The overall architecture of WikiSeeker in com [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of WikiSeeker. We employ VLMs as specialized agents rather than just generators. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The training pipeline of the VLM Refiner via [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of the multi-modal rerank stage weight [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the modality weighting hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the textual rerank stage weight [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt of WikiSeeker Refiner. WikiSeeker Inspector Prompt. System Prompt: Character Introduction You are an assistant to determine the consistency and completeness of the provided context in relation to a question and an image <image>. You will receive a question and a retrieved context. Follow these steps: 1. Check if the context is consistent with both the image and the question. 2. Determine if the cont… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt of WikiSeeker Inspector [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt of WikiSeeker Generator on EVQA. WikiSeeker Generator Prompt on InfoSeek. System Prompt: You are a helpful assistant for answering encyclopedic questions. Do not answer anything else.If you need to answer questions about numbers or time, please output the corresponding numerical format directly.If the context does not contain the information required to answer the question, you should answer the que… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt of WikiSeeker Generator on InfoSeek. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for GT answer expansion. Prompt designed for summarizing extended Wikipedia sections. System Prompt: Summarize the following Wikipedia section concisely while preserving key information. User Prompt: Article: {title} Section: {section title} Content: {section text} Provide a concise summary [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt designed for summarizing extended Wikipedia sections. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for image caption and zero-shot query expansion. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative examples demonstrating the effectiveness of the Refiner. For each case, the original user [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of the decoupled generation strategy enabled by the Inspector. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces WikiSeeker, a multi-modal Retrieval-Augmented Generation (RAG) framework for Knowledge-Based Visual Question Answering (KB-VQA). It proposes a multi-modal retriever and redefines the roles of Vision-Language Models (VLMs) as a 'Refiner' that rewrites textual queries conditioned on the input image to enhance retrieval, and an 'Inspector' that selectively routes reliable retrieved contexts to an external LLM for generation while falling back to the VLM's internal knowledge when retrieval is unreliable. The paper reports state-of-the-art results on the EVQA, InfoSeek, and M2KR benchmarks, claiming substantial improvements in both retrieval accuracy and answer quality.

Significance. If the empirical claims hold and the performance gains can be attributed to the proposed VLM role redefinitions, this work would represent a meaningful advance in KB-VQA by more fully leveraging VLMs within RAG systems. The decoupled generation strategy could address common issues with unreliable retrieval in multimodal settings. The promised code release would further enhance its impact by enabling reproducibility.

major comments (3)
  1. [Experiments] Experiments section: The central claims of SOTA performance and attribution to the Refiner and Inspector roles require isolated validation. Without ablations showing the impact of query rewriting on retrieval metrics (e.g., recall@K improvements) and Inspector precision/recall for routing decisions, it is unclear whether the reported gains stem from these components or from the underlying multi-modal retriever and other implementation choices.
  2. [Method] Method section (Refiner/Inspector description): The Inspector's reliability assessment lacks detail on how 'reliable' is determined (e.g., specific thresholds, features used, or training of the decision process). This makes it difficult to assess the robustness of the decoupled strategy and whether the routing actually improves answer quality over always using the VLM or always using the LLM.
  3. [Results] Results tables: While the abstract asserts SOTA results with substantial improvements, the manuscript must include explicit quantitative metrics, baseline comparisons, and error analysis on all three datasets (EVQA, InfoSeek, M2KR) to substantiate the claims; current summary provides no numbers, undermining assessment of effect sizes.
minor comments (2)
  1. [Introduction] Introduction: Clarify how the new VLM roles differ from standard VLM usage in prior multimodal RAG works to better highlight novelty.
  2. [Related Work] Related Work: Ensure comprehensive citation of recent multimodal RAG and VLM-based KB-VQA methods for context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We believe the suggested revisions will strengthen the paper, and we address each major comment below, outlining the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claims of SOTA performance and attribution to the Refiner and Inspector roles require isolated validation. Without ablations showing the impact of query rewriting on retrieval metrics (e.g., recall@K improvements) and Inspector precision/recall for routing decisions, it is unclear whether the reported gains stem from these components or from the underlying multi-modal retriever and other implementation choices.

    Authors: We agree that ablations are essential to validate the contributions of the Refiner and Inspector. In the revised version, we will add detailed ablation studies in the Experiments section. These will include the impact of the Refiner's query rewriting on retrieval metrics such as recall@K, and the Inspector's precision and recall for routing decisions. We will also compare the full WikiSeeker against variants without each component to attribute the gains specifically to the proposed VLM roles rather than other factors. revision: yes

  2. Referee: [Method] Method section (Refiner/Inspector description): The Inspector's reliability assessment lacks detail on how 'reliable' is determined (e.g., specific thresholds, features used, or training of the decision process). This makes it difficult to assess the robustness of the decoupled strategy and whether the routing actually improves answer quality over always using the VLM or always using the LLM.

    Authors: We acknowledge the need for more detail on the Inspector. The revised manuscript will expand the description of the Inspector's reliability assessment, specifying the thresholds, features used, and the decision process (including any training). We will also include additional experiments showing the performance of the decoupled strategy compared to always using the VLM or always using the external LLM, demonstrating the benefits of selective routing. revision: yes

  3. Referee: [Results] Results tables: While the abstract asserts SOTA results with substantial improvements, the manuscript must include explicit quantitative metrics, baseline comparisons, and error analysis on all three datasets (EVQA, InfoSeek, M2KR) to substantiate the claims; current summary provides no numbers, undermining assessment of effect sizes.

    Authors: The manuscript includes results tables with metrics and comparisons on EVQA, InfoSeek, and M2KR. To address this, we will revise the abstract to include key quantitative results, ensure explicit reporting of effect sizes in the tables, and add an error analysis section with discussion of failure cases on all datasets. This will provide a clearer substantiation of the SOTA performance and improvements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework is empirically validated on external benchmarks

full rationale

The paper presents WikiSeeker as an empirical multi-modal RAG framework that reassigns VLMs to Refiner (for image-conditioned query rewriting) and Inspector (for reliability-based routing to LLM or internal knowledge) roles. These assignments are motivated directly from stated limitations of prior KB-VQA methods (over-reliance on image-only retrieval keys) rather than derived from any self-referential equations or fitted parameters. Performance is demonstrated via end-to-end results on independent benchmarks EVQA, InfoSeek, and M2KR, with no mathematical derivation chain, no predictions that reduce to inputs by construction, and no load-bearing self-citations or uniqueness theorems invoked. The approach extends standard RAG concepts without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no explicit free parameters, new axioms beyond standard machine-learning assumptions, or invented entities; the framework relies on existing VLM and retrieval components.

pith-pipeline@v0.9.0 · 5537 in / 1077 out tokens · 33343 ms · 2026-05-10T19:53:34.433841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2425--2433

  4. [4]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025 a . Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, and 1 others. 2025 b . Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923

  6. [6]

    Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPRW), pages 1818--1826

  7. [7]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216

  8. [8]

    Lin Chen, Yingjian Zhu, Qi Yang, Xin Niu, Kun Ding, and Shiming Xiang. 2025. Sam-mi: A mask-injected framework for enhancing open-vocabulary semantic segmentation with sam. arXiv preprint arXiv:2511.20027

  9. [9]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 14948--14968

  10. [10]

    Changin Choi, Wonseok Lee, Jungmin Ko, and Wonjong Rhee. 2025. Multimodal iterative rag for knowledge-intensive visual question answering. arXiv preprint arXiv:2509.00798

  11. [11]

    Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2025. Augmenting multimodal llms with self-reflective tokens for knowledge-based visual question answering. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 9199--9209

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

  13. [13]

    Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. 2023. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12065--12075

  14. [14]

    Pu Jian, Donglei Yu, and Jiajun Zhang. 2024. Large language models know what is key visual entity: An LLM -assisted multimodal retrieval for VQA . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10939--10956

  15. [15]

    Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han. 2025. Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning. arXiv preprint arXiv:2503.00223

  16. [16]

    Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535--547

  17. [17]

    Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. 2024. P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 5294--5316

  18. [18]

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3195--3204

  19. [19]

    Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems (NeurIPS), 37:124198--124235

  20. [20]

    Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. 2023. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3113--3124

  21. [21]

    Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. 2020. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1527--1536

  22. [22]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 35:27730--27744

  23. [23]

    Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jin Di, Yu Cheng, Qifan Wang, and Lifu Huang. 2024. Rora-vlm: Robust retrieval-augmented vision language models. arXiv preprint arXiv:2410.08876

  24. [24]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS), 36:53728--53741

  25. [25]

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision (ECCV), pages 146--162. Springer

  26. [26]

    Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. 2019. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, pages 8876--8884

  27. [27]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  28. [28]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems (ECCS), pages 1279--1297

  29. [29]

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. 2023. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389

  30. [30]

    W., Yupeng Hu, and Liqiang Nie

    Yang Tian, Fan Liu, Jingyuan Zhang, V. W., Yupeng Hu, and Liqiang Nie. 2025. C o R e- MMRAG : Cross-source knowledge reconciliation for multimodal RAG . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 32967--32982

  31. [31]

    Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Henge. 2017 a . Explicit knowledge-based reasoning for visual question answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), pages 1290--1296

  32. [32]

    Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017 b . Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(10):2413--2427

  33. [33]

    Xinming Wang, Jian Xu, Aslan H Feng, Yi Chen, Haiyang Guo, Fei Zhu, Yuanqi Shao, Minsi Ren, Hongzhu Yi, Sheng Lian, and 1 others. 2025 a . The hitchhiker’s guide to autonomous research: A survey of scientific agents. TechRxiv.August 07, 2025. DOI:10.36227/techrxiv175459840.02185500/V1

  34. [34]

    Xinming Wang, Jian Xu, Bin Yu, Sheng Lian, Hongzhu Yi, Yi Chen, Yingjian Zhu, Boran Wang, Hongming Yang, Han Hu, and 1 others. 2025 b . Mr-align: Meta-reasoning informed factuality alignment for large reasoning models. arXiv preprint arXiv:2510.24794

  35. [35]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems (NeurIPS), 35:24824--24837

  36. [36]

    Yibin Yan and Weidi Xie. 2024. E cho S ight: Advancing visual-language models with W iki knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP), pages 1538--1551

  37. [37]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 40 others. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

  38. [38]

    Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. 2025. OMGM : Orchestrate multiple granularities and modalities for efficient multimodal retrieval. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 24545--24563

  39. [39]

    Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, and 1 others. 2024. mr 2 ag: Multimodal retrieval-reflection-augmented generation for knowledge-based vqa. arXiv preprint arXiv:2411.15041

  40. [40]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675

  41. [41]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, and 1 others. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176

  42. [42]

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  43. [43]

    Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding, Xin Gu, Bin Fan, and Shiming Xiang. 2026. Seavis: Sound-enhanced association for online audio-visual instance segmentation. arXiv preprint arXiv:2603.01431