WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
hub
Can pre-trained vision and language models answer visual information-seeking questions?
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 1polarities
background 1representative citing papers
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected trajectories.
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.
citing papers explorer
-
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
-
Evaluating the Search Agent in a Parallel World
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
-
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
-
Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering
A decision-based agent for KB-VQA learns to dynamically select retrieval or answer actions over multiple steps and achieves state-of-the-art results on InfoSeek and E-VQA after fine-tuning on automatically collected trajectories.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
-
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
-
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
-
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
-
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.
- R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation