Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

Qian Ma; Qiong Wu; Yao Ma; Zhengyi Zhou

arxiv: 2606.23881 · v1 · pith:75DM6RXXnew · submitted 2026-06-22 · 💻 cs.CL · cs.CV· cs.IR

Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

Qian Ma , Qiong Wu , Zhengyi Zhou , Yao Ma This is my paper

Pith reviewed 2026-06-26 08:10 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.IR

keywords knowledge-based VQAentity identificationtraining-free methodsmultimodal retrievalevidence rankingMLLM promptingvisual question answering

0 comments

The pith

Decoupling entity identification from evidence ranking improves knowledge-based visual question answering without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing multimodal retrieval methods for knowledge-based visual QA combine entity discrimination and evidence ranking into one costly stage, which limits performance and requires fine-tuning. It observes that multimodal models handle entity selection more reliably when given a short list of candidate names than when generating names from scratch. The proposed identify-before-answer workflow therefore prompts a multimodal model only for entity choice, then hands the fixed entity to an ordinary text re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show this separation raises accuracy over fine-tuned baselines while eliminating training and lowering inference cost. The gains come from both more accurate entities and, once the entity is fixed, more informative evidence sections.

Core claim

A training-free identify-before-answer framework first prompts an MLLM to pick the correct entity from a small candidate set and then applies an off-the-shelf textual re-ranker for evidence; this decoupled workflow outperforms fine-tuned multimodal re-ranking baselines on Encyclopedic-VQA and InfoSeek because it exploits the model strength in selection tasks and yields both better entity grounding and higher-quality evidence once the entity is known.

What carries the argument

The identify-before-answer (IBA) workflow that isolates MLLM-based entity selection from candidate names before textual evidence re-ranking.

If this is right

Fixing the entity first produces measurably more informative evidence sections than joint multimodal ranking.
Removing the need to fine-tune multimodal models reduces both training compute and inference latency.
The separation makes the pipeline easier to adapt to new knowledge sources without retraining.
Entity-level and fact-level grounding emerge as distinct bottlenecks that can be addressed independently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same candidate-selection trick could be tested on other grounding tasks where models struggle with open generation but succeed at multiple-choice selection.
Automatically constructing the candidate entity lists from the image or question alone would remove dependence on external knowledge bases.
Scaling the method to larger or noisier knowledge collections would reveal whether the decoupling remains stable when candidate sets grow.

Load-bearing premise

Multimodal models identify entities more reliably when choosing from a small set of candidate names than when generating names without constraints.

What would settle it

A head-to-head test on the same queries showing that the MLLM performs no better at candidate selection than at open-ended naming, or a run of the full IBA pipeline that fails to exceed the fine-tuned multimodal baselines on Encyclopedic-VQA or InfoSeek.

Figures

Figures reproduced from arXiv: 2606.23881 by Qian Ma, Qiong Wu, Yao Ma, Zhengyi Zhou.

**Figure 2.** Figure 2: Qualitative result E-VQA, where we compare the answers provided by IBA with EchoSight ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Sample qualitative results on image-question pairs from InfoSeek, where we compare the answers provided [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Sample qualitative results on image-question pairs from E-VQA, where we compare the answers provided [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Sample qualitative results on image-question pairs from E-VQA, where we compare the answers provided [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Error cases on image-question pairs from E-VQA, where we compare the answers provided by our [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupling of entity selection from evidence ranking is a practical workflow tweak that beats the fine-tuned baselines on the two datasets, but the training-free claim rests on an unstated candidate-generation step.

read the letter

The main thing here is they split the MM-RAG process for KB-VQA so an MLLM only has to pick the right entity name from a short list, then an ordinary text re-ranker handles the evidence. That matches their point that MLLMs are better at selection than open-ended naming, and the experiments on Encyclopedic-VQA and InfoSeek show the split beats fine-tuned multimodal rerankers while cutting training and inference cost. They also release the code, which is useful.

What works is the empirical demonstration that fixing the entity first improves the downstream evidence selection too. The workflow is simple and directly addresses a real bottleneck in these tasks.

The soft spot is candidate sourcing. The abstract never says how the small set of names is produced. If that step uses a trained retriever or KB lookup, the training-free label and the comparison to other methods both get weaker, exactly as the stress-test note flags. Without that detail the central advantage is hard to judge.

This is a narrow but usable paper for people already working on knowledge-based VQA or lightweight multimodal retrieval. It does not claim to shift broader methods, just offers a cleaner pipeline on these datasets.

I would send it to peer review so the candidate step and the actual numbers can be examined.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a training-free Identify-Before-Answer (IBA) framework for Knowledge-Based Visual Question Answering (KB-VQA) that decouples entity identification from evidence re-ranking. An MLLM is prompted to select the correct entity from a small set of candidate names; an off-the-shelf textual re-ranker then selects evidence sections. The authors claim this workflow consistently outperforms fine-tuned multi-modal re-ranking baselines on Encyclopedic-VQA and InfoSeek, reduces training and inference complexity, and that gains arise from both improved entity grounding and better evidence selection once the entity is fixed. The implementation is released publicly.

Significance. If the empirical claims hold after clarification of the workflow, the result would be significant for KB-VQA and MM-RAG research: it demonstrates that a simple decoupled pipeline can surpass integrated fine-tuned multi-modal models while lowering cost, and it isolates the contribution of entity-level versus evidence-level grounding. The public code release is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[§3] §3 (Method): The procedure for obtaining the candidate entity names supplied to the MLLM is not specified. This step is load-bearing for the central claim of a training-free workflow and for the asserted reduction in complexity relative to fine-tuned multi-modal re-rankers; if candidate generation itself requires a trained retriever or non-trivial KB lookup, the decoupling advantage and the fairness of the baseline comparisons become unclear.
[§4] §4 (Experiments): The abstract asserts 'consistent outperformance' and attributes gains to both entity and evidence improvements, yet the provided description supplies no quantitative numbers, baseline details, statistical tests, or error analysis. Without these, the strength of the empirical support for the IBA framework cannot be assessed.

minor comments (2)

The abstract would be strengthened by including one or two key performance numbers (e.g., accuracy deltas on each dataset) to substantiate the outperformance claim.
Notation for the IBA stages (entity selection vs. re-ranking) should be introduced once and used consistently in the method and analysis sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The procedure for obtaining the candidate entity names supplied to the MLLM is not specified. This step is load-bearing for the central claim of a training-free workflow and for the asserted reduction in complexity relative to fine-tuned multi-modal re-rankers; if candidate generation itself requires a trained retriever or non-trivial KB lookup, the decoupling advantage and the fairness of the baseline comparisons become unclear.

Authors: We agree that the candidate generation procedure must be specified explicitly for the training-free claim to be fully evaluable. In the revised manuscript we will expand §3 with a precise description of how the small set of candidate names is produced (a lightweight, training-free entity linking step over the question text that does not rely on learned retrievers). This addition will also clarify why the subsequent MLLM selection and off-the-shelf re-ranker remain decoupled and comparable to the fine-tuned baselines. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts 'consistent outperformance' and attributes gains to both entity and evidence improvements, yet the provided description supplies no quantitative numbers, baseline details, statistical tests, or error analysis. Without these, the strength of the empirical support for the IBA framework cannot be assessed.

Authors: The experiments section of the manuscript already reports concrete accuracy numbers on Encyclopedic-VQA and InfoSeek together with baseline comparisons; however, we acknowledge that additional quantitative detail, statistical tests, and error analysis would strengthen the presentation. In the revision we will augment §4 with the requested numbers, explicit baseline configurations, significance tests, and a brief error analysis that isolates entity-level versus evidence-level contributions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical workflow without derivations or fitted predictions

full rationale

The paper proposes an identify-before-answer workflow for KB-VQA based on an empirical observation about MLLM entity selection from candidates, followed by off-the-shelf re-ranking. No equations, parameters, or mathematical derivations are present. The central claims rest on experimental comparisons to baselines on public datasets (Encyclopedic-VQA, InfoSeek), not on any reduction of outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The method is self-contained as a training-free empirical approach.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or new postulated entities are introduced; the method rests on an empirical observation about MLLM selection behavior and the availability of off-the-shelf re-rankers.

pith-pipeline@v0.9.1-grok · 5796 in / 1139 out tokens · 27028 ms · 2026-06-26T08:10:20.768886+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 7 linked inside Pith

[1]

arXiv preprint arXiv:2407.12735 , year=

EchoSight: Advancing visual-language models with Wiki knowledge , author=. arXiv preprint arXiv:2407.12735 , year=

arXiv
[2]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Augmenting multimodal llms with self-reflective tokens for knowledge-based visual question answering , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[3]

arXiv preprint arXiv:2506.02544 , year=

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG , author=. arXiv preprint arXiv:2506.02544 , year=

arXiv
[4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[5]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=
[6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[7]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[8]

European Conference on Computer Vision , pages=

Captioning images taken by people who are blind , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020
[9]

Workshop on Demographic Diversity in Computer Vision@ CVPR 2025 , year=

Long-form answers to visual questions from blind and low vision people , author=. Workshop on Demographic Diversity in Computer Vision@ CVPR 2025 , year=

2025
[10]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

HierMEQA: A Relationship-Aware Hierarchical Framework for Consistent Micro-Expression Visual Question Answering , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=
[11]

International Conference on Human-Computer Interaction , pages=

Aug-Creativity: Framework for Human-Centered Creativity with Vision Language Models , author=. International Conference on Human-Computer Interaction , pages=. 2025 , organization=

2025
[12]

Understanding

What's Different between Visual Question Answering for Machine" Understanding" Versus for Accessibility? , author=. arXiv preprint arXiv:2210.14966 , year=

arXiv
[13]

ACM Computing Surveys , volume=

Visual question answering: A survey of methods, datasets, evaluation, and challenges , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025
[14]

arXiv preprint arXiv:2504.17547 , year=

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task , author=. arXiv preprint arXiv:2504.17547 , year=

arXiv
[15]

IEEE transactions on pattern analysis and machine intelligence , volume=

Fvqa: Fact-based visual question answering , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2017 , publisher=

2017
[16]

Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=
[17]

European conference on computer vision , pages=

A-okvqa: A benchmark for visual question answering using world knowledge , author=. European conference on computer vision , pages=. 2022 , organization=

2022
[18]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[19]

arXiv preprint arXiv:2302.11713 , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. arXiv preprint arXiv:2302.11713 , year=

arXiv
[20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Benchmarking representation learning for natural world image collections , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[21]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[22]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024
[23]

2023 IEEE International Conference on Big Data (BigData) , pages=

Multimodal large language models: A survey , author=. 2023 IEEE International Conference on Big Data (BigData) , pages=. 2023 , organization=

2023
[24]

arXiv preprint arXiv:2402.12451 , year=

The revolution of multimodal large language models: a survey , author=. arXiv preprint arXiv:2402.12451 , year=

arXiv
[25]

ACM Computing Surveys , volume=

Natural language understanding and inference with mllm in visual question answering: A survey , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025
[26]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Soft knowledge prompt: Help external knowledge become a better teacher to instruct llm in knowledge-based vqa , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[28]

arXiv preprint arXiv:2411.15041 , year=

mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA , author=. arXiv preprint arXiv:2411.15041 , year=

arXiv
[29]

Advances in Neural Information Processing Systems , volume=

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=
[30]

arXiv preprint arXiv:2402.08327 , year=

Preflmr: Scaling up fine-grained late-interaction multi-modal retrievers , author=. arXiv preprint arXiv:2402.08327 , year=

arXiv
[31]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

MuKA: Multimodal knowledge augmented visual information-seeking , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=
[32]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[34]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[35]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=
[36]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[37]

arXiv preprint arXiv:2402.13561 , year=

Cognitive visual-language mapper: Advancing multimodal comprehension with enhanced visual knowledge alignment , author=. arXiv preprint arXiv:2402.13561 , year=

arXiv
[38]

arXiv preprint arXiv:2505.24840 , year=

Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck , author=. arXiv preprint arXiv:2505.24840 , year=

arXiv
[39]

European Conference on Computer Vision , pages=

Grounding language models for visual entity recognition , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[40]

Advances in Neural Information Processing Systems , volume=

Web-scale visual entity recognition: An LLM-driven data approach , author=. Advances in Neural Information Processing Systems , volume=
[41]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

A generative approach for wikipedia-scale visual entity recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[42]

arXiv preprint arXiv:2402.04252 , year=

Eva-clip-18b: Scaling clip to 18 billion parameters , author=. arXiv preprint arXiv:2402.04252 , year=

arXiv
[43]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[44]

IEEE Transactions on Big Data , year=

The faiss library , author=. IEEE Transactions on Big Data , year=
[45]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010
[46]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[47]

arXiv preprint arXiv:1904.09675 , year=

Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

Pith/arXiv arXiv 1904
[48]

tip of the tongue

The “tip of the tongue” phenomenon , author=. Journal of verbal learning and verbal behavior , volume=. 1966 , publisher=

1966
[49]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Murag: Multimodal retrieval-augmented generator for open question answering over images and text , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[50]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
[51]

2023 , eprint=

Making Large Language Models A Better Foundation For Dense Retrieval , author=. 2023 , eprint=

2023
[52]

arXiv preprint arXiv:2402.03216 , year=

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=

Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2505.15517 , year=

Robo2vlm: Visual question answering from large-scale in-the-wild robot manipulation datasets , author=. arXiv preprint arXiv:2505.15517 , year=

arXiv
[54]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[55]

European Conference on Computer Vision , pages=

Lingoqa: Visual question answering for autonomous driving , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[56]

arXiv preprint arXiv:2504.05288 , year=

LiveVQA: Live Visual Knowledge Seeking , author=. arXiv preprint arXiv:2504.05288 , year=

arXiv
[57]

arXiv preprint arXiv:2409.12959 , year=

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

arXiv
[58]

arXiv preprint arXiv:2506.20670 , year=

MMSearch-R1: Incentivizing LMMs to Search , author=. arXiv preprint arXiv:2506.20670 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2001.08361 , year=

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

Pith/arXiv arXiv 2001
[60]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[61]

ICLR , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ICLR , year=
[62]

arXiv preprint arXiv:1907.11692 , year=

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

Pith/arXiv arXiv 1907
[63]

European Conference on Computer Vision , pages=

Three things everyone should know about vision transformers , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[64]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Token pooling in vision transformers for image classification , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
[65]

Advances in Neural Information Processing Systems , volume=

MG-ViT: a multi-granularity method for compact and efficient vision transformers , author=. Advances in Neural Information Processing Systems , volume=
[66]

2025 , bdsk-url-1 =

Update to. 2025 , bdsk-url-1 =

2025

[1] [1]

arXiv preprint arXiv:2407.12735 , year=

EchoSight: Advancing visual-language models with Wiki knowledge , author=. arXiv preprint arXiv:2407.12735 , year=

arXiv

[2] [2]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Augmenting multimodal llms with self-reflective tokens for knowledge-based visual question answering , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[3] [3]

arXiv preprint arXiv:2506.02544 , year=

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG , author=. arXiv preprint arXiv:2506.02544 , year=

arXiv

[4] [4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[5] [5]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

[6] [6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[7] [7]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[8] [8]

European Conference on Computer Vision , pages=

Captioning images taken by people who are blind , author=. European Conference on Computer Vision , pages=. 2020 , organization=

2020

[9] [9]

Workshop on Demographic Diversity in Computer Vision@ CVPR 2025 , year=

Long-form answers to visual questions from blind and low vision people , author=. Workshop on Demographic Diversity in Computer Vision@ CVPR 2025 , year=

2025

[10] [10]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

HierMEQA: A Relationship-Aware Hierarchical Framework for Consistent Micro-Expression Visual Question Answering , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

[11] [11]

International Conference on Human-Computer Interaction , pages=

Aug-Creativity: Framework for Human-Centered Creativity with Vision Language Models , author=. International Conference on Human-Computer Interaction , pages=. 2025 , organization=

2025

[12] [12]

Understanding

What's Different between Visual Question Answering for Machine" Understanding" Versus for Accessibility? , author=. arXiv preprint arXiv:2210.14966 , year=

arXiv

[13] [13]

ACM Computing Surveys , volume=

Visual question answering: A survey of methods, datasets, evaluation, and challenges , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025

[14] [14]

arXiv preprint arXiv:2504.17547 , year=

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task , author=. arXiv preprint arXiv:2504.17547 , year=

arXiv

[15] [15]

IEEE transactions on pattern analysis and machine intelligence , volume=

Fvqa: Fact-based visual question answering , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2017 , publisher=

2017

[16] [16]

Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

Ok-vqa: A visual question answering benchmark requiring external knowledge , author=. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages=

[17] [17]

European conference on computer vision , pages=

A-okvqa: A benchmark for visual question answering using world knowledge , author=. European conference on computer vision , pages=. 2022 , organization=

2022

[18] [18]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[19] [19]

arXiv preprint arXiv:2302.11713 , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. arXiv preprint arXiv:2302.11713 , year=

arXiv

[20] [20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Benchmarking representation learning for natural world image collections , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[21] [21]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[22] [22]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024

[23] [23]

2023 IEEE International Conference on Big Data (BigData) , pages=

Multimodal large language models: A survey , author=. 2023 IEEE International Conference on Big Data (BigData) , pages=. 2023 , organization=

2023

[24] [24]

arXiv preprint arXiv:2402.12451 , year=

The revolution of multimodal large language models: a survey , author=. arXiv preprint arXiv:2402.12451 , year=

arXiv

[25] [25]

ACM Computing Surveys , volume=

Natural language understanding and inference with mllm in visual question answering: A survey , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025

[26] [26]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[27] [27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Soft knowledge prompt: Help external knowledge become a better teacher to instruct llm in knowledge-based vqa , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[28] [28]

arXiv preprint arXiv:2411.15041 , year=

mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA , author=. arXiv preprint arXiv:2411.15041 , year=

arXiv

[29] [29]

Advances in Neural Information Processing Systems , volume=

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering , author=. Advances in Neural Information Processing Systems , volume=

[30] [30]

arXiv preprint arXiv:2402.08327 , year=

Preflmr: Scaling up fine-grained late-interaction multi-modal retrievers , author=. arXiv preprint arXiv:2402.08327 , year=

arXiv

[31] [31]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

MuKA: Multimodal knowledge augmented visual information-seeking , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

[32] [32]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

[33] [33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[34] [34]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[35] [35]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

[36] [36]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[37] [37]

arXiv preprint arXiv:2402.13561 , year=

Cognitive visual-language mapper: Advancing multimodal comprehension with enhanced visual knowledge alignment , author=. arXiv preprint arXiv:2402.13561 , year=

arXiv

[38] [38]

arXiv preprint arXiv:2505.24840 , year=

Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck , author=. arXiv preprint arXiv:2505.24840 , year=

arXiv

[39] [39]

European Conference on Computer Vision , pages=

Grounding language models for visual entity recognition , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[40] [40]

Advances in Neural Information Processing Systems , volume=

Web-scale visual entity recognition: An LLM-driven data approach , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

A generative approach for wikipedia-scale visual entity recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[42] [42]

arXiv preprint arXiv:2402.04252 , year=

Eva-clip-18b: Scaling clip to 18 billion parameters , author=. arXiv preprint arXiv:2402.04252 , year=

arXiv

[43] [43]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[44] [44]

IEEE Transactions on Big Data , year=

The faiss library , author=. IEEE Transactions on Big Data , year=

[45] [45]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010

[46] [46]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[47] [47]

arXiv preprint arXiv:1904.09675 , year=

Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

Pith/arXiv arXiv 1904

[48] [48]

tip of the tongue

The “tip of the tongue” phenomenon , author=. Journal of verbal learning and verbal behavior , volume=. 1966 , publisher=

1966

[49] [49]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Murag: Multimodal retrieval-augmented generator for open question answering over images and text , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[50] [50]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

[51] [51]

2023 , eprint=

Making Large Language Models A Better Foundation For Dense Retrieval , author=. 2023 , eprint=

2023

[52] [52]

arXiv preprint arXiv:2402.03216 , year=

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=

Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2505.15517 , year=

Robo2vlm: Visual question answering from large-scale in-the-wild robot manipulation datasets , author=. arXiv preprint arXiv:2505.15517 , year=

arXiv

[54] [54]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[55] [55]

European Conference on Computer Vision , pages=

Lingoqa: Visual question answering for autonomous driving , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[56] [56]

arXiv preprint arXiv:2504.05288 , year=

LiveVQA: Live Visual Knowledge Seeking , author=. arXiv preprint arXiv:2504.05288 , year=

arXiv

[57] [57]

arXiv preprint arXiv:2409.12959 , year=

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

arXiv

[58] [58]

arXiv preprint arXiv:2506.20670 , year=

MMSearch-R1: Incentivizing LMMs to Search , author=. arXiv preprint arXiv:2506.20670 , year=

Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2001.08361 , year=

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

Pith/arXiv arXiv 2001

[60] [60]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019

[61] [61]

ICLR , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ICLR , year=

[62] [62]

arXiv preprint arXiv:1907.11692 , year=

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

Pith/arXiv arXiv 1907

[63] [63]

European Conference on Computer Vision , pages=

Three things everyone should know about vision transformers , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[64] [64]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Token pooling in vision transformers for image classification , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

[65] [65]

Advances in Neural Information Processing Systems , volume=

MG-ViT: a multi-granularity method for compact and efficient vision transformers , author=. Advances in Neural Information Processing Systems , volume=

[66] [66]

2025 , bdsk-url-1 =

Update to. 2025 , bdsk-url-1 =

2025