hub

Can pre-trained vision and language models answer visual information-seeking questions?

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, Ming-Wei Chang · 2023 · arXiv 2302.11713

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

background 1

representative citing papers

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

Evaluating the Search Agent in a Parallel World

cs.AI · 2026-03-05 · unverdicted · novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.

MMSearch-R1: Incentivizing LMMs to Search

cs.CV · 2025-06-25 · unverdicted · novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected trajectories.

DeepEyesV2: Toward Agentic Multimodal Model

cs.CV · 2025-11-07 · unverdicted · novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

cs.IR · 2025-08-07 · unverdicted · novelty 6.0

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

cs.CL · 2025-05-28 · unverdicted · novelty 6.0

MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

cs.CV · 2026-04-22 · unverdicted · novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.

Delineating Knowledge Boundaries for Honest Large Vision-Language Models

cs.CV · 2026-04-29 · unverdicted · novelty 4.0

VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.

PaLI-X: On Scaling up a Multilingual Vision and Language Model

cs.CV · 2023-05-29 · unverdicted · novelty 4.0

Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

cs.CV · 2026-01-25

citing papers explorer

Showing 12 of 12 citing papers.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition cs.CV · 2026-03-10 · unverdicted · none · ref 6
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
Evaluating the Search Agent in a Parallel World cs.AI · 2026-03-05 · unverdicted · none · ref 4
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
MMSearch-R1: Incentivizing LMMs to Search cs.CV · 2025-06-25 · unverdicted · none · ref 9
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 36
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering cs.CV · 2026-04-08 · unverdicted · none · ref 1
A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected trajectories.
DeepEyesV2: Toward Agentic Multimodal Model cs.CV · 2025-11-07 · unverdicted · none · ref 7
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent cs.IR · 2025-08-07 · unverdicted · none · ref 5
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation cs.CL · 2025-05-28 · unverdicted · none · ref 12
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards cs.CV · 2026-04-22 · unverdicted · none · ref 3
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
Delineating Knowledge Boundaries for Honest Large Vision-Language Models cs.CV · 2026-04-29 · unverdicted · none · ref 2
VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
PaLI-X: On Scaling up a Multilingual Vision and Language Model cs.CV · 2023-05-29 · unverdicted · none · ref 48
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.
R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation cs.CV · 2026-01-25 · unreviewed · ref 17

Can pre-trained vision and language models answer visual information-seeking questions?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer