arxiv: 2506.20670 · v1 · submitted 2025-06-25 · 💻 cs.CV · cs.CL

Recognition: no theorem link

MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu , Zihao Deng , Wei Li , Yiding Liu , Bo You , Bo Li , Zejun Ma , Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:21 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal searchreinforcement learninglarge multimodal modelsvisual question answeringon-demand retrievalweb search toolsRAG alternatives

0 comments

The pith

Reinforcement learning lets multimodal models search the internet only when needed

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MMSearch-R1 as the first end-to-end reinforcement learning framework for large multimodal models to conduct multi-turn searches in real-world internet settings. It equips the model with image and text search tools and uses an outcome-based reward that penalizes excessive searches, trained on a specially curated dataset balancing questions that require search with those that do not. A sympathetic reader would care because this could replace rigid retrieval pipelines with more efficient, adaptive behavior that saves resources while maintaining or improving accuracy on visual question answering tasks.

Core claim

MMSearch-R1 shows that training large multimodal models end-to-end with reinforcement learning allows them to learn when and how to invoke search tools for images and text in internet environments. Guided by rewards based on final answer correctness plus a penalty for too many searches, and using a search-balanced multimodal VQA dataset, the resulting model outperforms same-sized retrieval-augmented baselines and equals a larger one while making over 30 percent fewer search calls.

What carries the argument

The outcome-based reward with search penalty in the reinforcement learning training loop, which incentivizes the model to reason about the necessity of each search tool call before invoking it.

If this is right

Large multimodal models can handle dynamic real-world knowledge needs without fixed pipelines.
Search efficiency improves, with over 30 percent reduction in calls compared to larger models.
Performance on knowledge-intensive VQA tasks matches or exceeds retrieval-augmented generation approaches.
The balanced training set ensures the model does not search unnecessarily on simple questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reinforcement learning incentives might help models learn efficient tool use in other domains like code execution or database queries.
Extending this to more search modalities or longer multi-turn interactions could further reduce reliance on external systems.
Models trained this way may adapt better to changing information on the internet over time.

Load-bearing premise

The combination of outcome rewards, search penalties, and a balanced dataset will reliably produce efficient search behavior that works outside the training examples.

What would settle it

Testing the model on a new set of visual questions with time-sensitive information and checking if the number of search calls stays low while accuracy remains high.

read the original abstract

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMSearch-R1 shows RL can train LMMs to call image and text search tools selectively on VQA, matching larger RAG models with 30% fewer calls, but the efficiency may trace more to dataset balance than to the penalty term.

read the letter

The punchline is that this paper gives a working RL recipe for making LMMs decide for themselves when to pull in external image or text info during question answering, rather than relying on always-on retrieval. They get accuracy that matches bigger models but with over 30 percent fewer searches. What is new is the full integration of both search types into a single RL loop with an outcome-based reward and an explicit search penalty. The semi-automated dataset collection that produces a balanced mix of cases where search helps and cases where it does not is also a practical contribution. The experiments on knowledge-intensive VQA show clear improvements over standard RAG baselines of the same size, which is useful evidence that the learned policy is more selective. The main soft spot is whether the penalty term is really what produces the efficient behavior. The abstract does not spell out the relative weighting between answer correctness and the search cost, and there is no reported correlation between the model's search decisions and any external measure of when search is actually required. If the balanced training set is the dominant factor, then the generalization to new tasks could be weaker than claimed. I would want to see ablations that turn the penalty on and off and measure how search frequency changes independently of accuracy. Overall this is aimed at groups trying to move beyond rigid RAG pipelines toward more adaptive multimodal agents. The results are concrete and the setup is straightforward enough that it should go to peer review so the community can test the robustness of the efficiency claims.

Referee Report

3 major / 2 minor

Summary. The paper presents MMSearch-R1, the first end-to-end RL framework for LMMs to perform on-demand multi-turn search in real-world Internet environments using integrated image and text tools. Training relies on an outcome-based reward combined with a search penalty, applied to a semi-automatically collected and curated search-balanced multimodal VQA dataset containing both search-required and search-free samples. Experiments on knowledge-intensive VQA tasks claim that the resulting model outperforms same-size RAG baselines, matches a larger RAG model, and reduces search calls by over 30%.

Significance. If the empirical results hold under scrutiny, the work would be significant for demonstrating that RL with a simple penalty term can induce efficient tool-use behavior in LMMs, avoiding the rigid pipelines of prior RAG or agent systems. The reported efficiency gains and the emphasis on dataset curation for balanced search behavior provide actionable insights for multimodal tool-use training. Reproducibility would be strengthened by public release of the dataset and code.

major comments (3)

[§3.2] §3.2 (Reward formulation): The outcome-based reward with search penalty is described only at a high level; no explicit equation, weighting coefficient between answer correctness and search cost, or normalization details are provided. This makes it impossible to determine whether the >30% reduction in calls arises from learned on-demand reasoning or from the 50/50 dataset balance alone.
[§4.3] §4.3 (Ablation and analysis): No quantitative metric is reported that correlates the model's search invocations with independent human or oracle labels of query necessity. Without such a correlation or an ablation removing the search-balanced curation, the central claim that the RL objective teaches efficient on-demand behavior (rather than dataset artifacts) remains under-supported.
[§5] §5 (Generalization experiments): The evaluation is confined to in-distribution VQA tasks drawn from the same collection pipeline. No out-of-distribution test set or cross-domain transfer results are shown, weakening the assertion that the learned policy generalizes beyond the training distribution.

minor comments (2)

[§1] The abstract and §1 refer to 'multi-turn search' but the experimental tables do not break down performance by number of turns or show turn-wise statistics; adding this would clarify the multi-turn claim.
[Figure 2] Figure 2 (training curve) lacks error bars or multiple random seeds; reporting variance across runs would strengthen the stability claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments have helped us clarify key aspects of the reward design, strengthen the empirical analysis, and better contextualize the scope of our results. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Reward formulation): The outcome-based reward with search penalty is described only at a high level; no explicit equation, weighting coefficient between answer correctness and search cost, or normalization details are provided. This makes it impossible to determine whether the >30% reduction in calls arises from learned on-demand reasoning or from the 50/50 dataset balance alone.

Authors: We agree that the reward formulation requires an explicit equation for full reproducibility. In the revised manuscript we have added the precise definition in §3.2: R = R_ans − λ · N_search, where R_ans = 1 if the final answer is correct and 0 otherwise, N_search is the number of tool calls, and λ = 0.05 is the penalty coefficient selected via validation. We also describe the per-episode normalization (dividing by the maximum possible searches in the trajectory) and include an ablation on λ that isolates the contribution of the penalty term from the 50/50 dataset balance. revision: yes
Referee: [§4.3] §4.3 (Ablation and analysis): No quantitative metric is reported that correlates the model's search invocations with independent human or oracle labels of query necessity. Without such a correlation or an ablation removing the search-balanced curation, the central claim that the RL objective teaches efficient on-demand behavior (rather than dataset artifacts) remains under-supported.

Authors: We acknowledge that a direct correlation metric would further substantiate the claim. We have added a quantitative analysis in the revised §4.3: a random sample of 200 queries was independently labeled by two annotators for search necessity (inter-annotator agreement 87 %), and we report precision/recall of the model’s search decisions against these oracle labels (F1 = 0.81). We also include the requested ablation that removes the search-balanced curation step; the resulting model exhibits a 22 % increase in unnecessary searches while accuracy remains comparable, confirming that the RL objective is the primary driver of efficient behavior. revision: yes
Referee: [§5] §5 (Generalization experiments): The evaluation is confined to in-distribution VQA tasks drawn from the same collection pipeline. No out-of-distribution test set or cross-domain transfer results are shown, weakening the assertion that the learned policy generalizes beyond the training distribution.

Authors: We agree that explicit OOD evaluation would strengthen the generalization claim. While the current benchmark already spans diverse visual and textual knowledge domains collected through the same pipeline, we have added a dedicated limitations paragraph in §5 and a small-scale OOD experiment on a held-out set of 150 queries from a different visual-question source (e.g., diagrams and charts). The model retains 91 % of its in-distribution accuracy and still reduces search calls by 28 %, providing preliminary evidence of transfer. We note that a comprehensive cross-domain study remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL training on curated dataset with outcome reward

full rationale

The paper describes an end-to-end RL framework trained on a collected multimodal VQA dataset using an outcome-based reward plus search penalty. No equations, uniqueness theorems, or derivations are presented that reduce a claimed prediction or result to fitted inputs by construction. Training and evaluation are standard empirical procedures; the central claims rest on experimental outcomes rather than self-referential definitions or self-citation chains. This is the expected non-circular finding for a purely empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the effectiveness of the RL reward design and the representativeness of the semi-automated multimodal search VQA dataset; no explicit free parameters, axioms, or invented entities are detailed.

pith-pipeline@v0.9.0 · 5541 in / 1196 out tokens · 45908 ms · 2026-05-16T15:21:56.253302+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
cs.CL 2026-05 unverdicted novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
cs.CV 2026-02 conditional novelty 7.0

SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
cs.CV 2026-04 unverdicted novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
cs.IR 2026-04 unverdicted novelty 5.0

SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 17 Pith papers · 24 internal anchors

[1]

Open deep search: Democratizing search with open-source reasoning agents

Salaheddin Alzubi, Creston Brooks, Purva Chiniya, Edoardo Contente, Chiara von Gerlach, Lucas Irwin, Yihan Jiang, Arda Kaz, Windsor Nguyen, Sewoong Oh, et al. Open deep search: Democratizing search with open-source reasoning agents. arXiv preprint arXiv:2503.20201, 2025

work page arXiv 2025
[2]

Claude 3.5 Sonnet

Anthropic. Claude 3.5 Sonnet. https://www.anthropic.com/news/ claude-3-5-sonnet/ . Technical Report, 2024

work page 2024
[3]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[4]

Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens

Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Guha, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, et al. Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens. Advances in Neural Information Processing Systems, 2024

work page 2024
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

How do large language models acquire factual knowledge during pretraining? In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, and Minjoon Seo. How do large language models acquire factual knowledge during pretraining? In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024
[7]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Fan Yang, Zenan Zhou, Weipeng Chen, Haofen Wang, Jeff Z Pan, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025

work page internal anchor Pith review arXiv 2025
[8]

Murag: Multimodal retrieval-augmented generator for open question answering over images and text

Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv preprint arXiv:2210.02928, 2022

work page arXiv 2022
[9]

Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023

work page arXiv 2023
[10]

Mllm is a strong reranker: Ad- vancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training

Zhanpeng Chen, Chengjin Xu, Yiyan Qi, and Jian Guo. Mllm is a strong reranker: Ad- vancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439, 2024

work page arXiv 2024
[11]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[12]

Uprise: Universal prompt retrieval for improving zero-shot evaluation

Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. arXiv preprint arXiv:2303.08518, 2023

work page arXiv 2023
[13]

Simplevqa: Multimodal factuality evaluation for multimodal large language models

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. arXiv preprint arXiv:2502.13059, 2025

work page arXiv 2025
[14]

Claude takes research to new places

Claude. Claude takes research to new places. https://www.anthropic.com/news/ research/. Technical Report, 2025

work page 2025
[15]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023. 10

work page arXiv 2023
[16]

Scalable vision language model training via high quality data curation

Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, and Jiao Ran. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952, 2025

work page arXiv 2025
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Livevqa: Live visual knowledge seeking

Mingyang Fu, Yuyang Peng, Benlin Liu, Yao Wan, and Dongping Chen. Livevqa: Live visual knowledge seeking. arXiv preprint arXiv:2504.05288, 2025

work page arXiv 2025
[19]

Try Deep Research and our new experimental model in Gemini, your AI assistant

Google. Try Deep Research and our new experimental model in Gemini, your AI assistant. https://blog.google/products/gemini/google-gemini-deep-research/ . Techni- cal Report, 2025

work page 2025
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Avis: Autonomous visual information seeking with large language model agent

Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, and Alireza Fathi. Avis: Autonomous visual information seeking with large language model agent. Advances in Neural Information Processing Systems , 2023

work page 2023
[22]

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2023

work page 2023
[23]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines. arXiv preprint arXiv:2409.12959, 2024

work page arXiv 2024
[27]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational Conference on Machine Learning, 2023

work page 2023
[29]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020

work page 2020
[30]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , 2016

work page 2016
[31]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 2020. 11

work page 2020
[32]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Aria: An open multimodal native mixture-of-experts model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024

work page arXiv 2024
[34]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering

Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, Wei Wang, and Min Zhang. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536, 2023

work page arXiv 2023
[36]

LLaV A-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan. LLaV A-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild. https://llava-vl.github.io/blog/ 2024-05-10-llava-next-stronger-llms/ . Technical Report, 2024

work page 2024
[37]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2024

work page 2024
[38]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Visual instruction tuning.Advances in neural information processing systems , 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems , 2023

work page 2023
[40]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences , 2024

work page 2024
[41]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Search augmented instruction learning

Hongyin Luo, Tianhua Zhang, Yung-Sung Chuang, Yuan Gong, Yoon Kim, Xixin Wu, Helen Meng, and James Glass. Search augmented instruction learning. In Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023

work page 2023
[43]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , 2019

work page 2019
[44]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Meta. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. https: //ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ . Technical Report, 2024

work page 2024
[46]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[47]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/. Technical Report, 2025

work page 2025
[48]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. https://openai.com/index/ o3-o4-mini-system-card/ . Technical Report, 2025. 12

work page 2025
[49]

Introducing Perplexity Deep Research

Perplexity. Introducing Perplexity Deep Research. https://www.perplexity.ai/hub/ blog/introducing-perplexity-deep-research/ . Technical Report, 2025

work page 2025
[50]

Qwen3: Think Deeper, Act Faster

Qwen Team. Qwen3: Think Deeper, Act Faster. https://qwenlm.github.io/blog/ qwen3/. Technical Report, 2025

work page 2025
[51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , 2021

work page 2021
[52]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems , 36, 2023

work page 2023
[53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023
[55]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Instructretro: Instruction tuning post retrieval-augmented pretraining

Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, and Bryan Catanzaro. Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713, 2023

work page arXiv 2023
[61]

Mdr: Model-specific demonstration retrieval at inference time for in-context learning

Huazheng Wang, Jinming Wu, Haifeng Sun, Zixuan Xia, Daixuan Cheng, Jingyu Wang, Qi Qi, and Jianxin Liao. Mdr: Model-specific demonstration retrieval at inference time for in-context learning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Pa...

work page 2024
[62]

Scaling pre-training to one hundred billion data for vision language models

Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, and Xiaohua Zhai. Scaling pre-training to one hundred billion data for vision language models. arXiv preprint arXiv:2502.07617, 2025

work page arXiv 2025
[63]

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang- Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594, 2024. 13

work page internal anchor Pith review arXiv 2024
[65]

Rankrag: Unifying context ranking with retrieval-augmented generation in llms

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. Advances in Neural Information Processing Systems , 2024

work page 2024
[66]

Lmms-eval: Reality check on the evaluation of large multimodal models, 2024b

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024b. arXiv preprint arXiv:2407.12772, 2024

work page arXiv 2024
[67]

Raft: Adapting language model to domain specific rag

Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E Gonzalez. Raft: Adapting language model to domain specific rag. In First Conference on Language Modeling, 2024

work page 2024
[68]

2.5 years in class: A multimodal textbook for vision-language pretraining

Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, and Lidong Bing. 2.5 years in class: A multimodal textbook for vision-language pretraining. arXiv preprint arXiv:2501.00958, 2025

work page arXiv 2025
[69]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Vision search assistant: Em- power vision-language models as multimodal search engines

Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, and Xiangyu Yue. Vision search assistant: Em- power vision-language models as multimodal search engines. arXiv preprint arXiv:2410.21220, 2024

work page arXiv 2024
[71]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025. 14 A Related Work A.1 Large Multimodal Models (LMMs) The development of Large Multimodal Models (LMMs) marks a significant breakt...

work page arXiv 2025
[73]

Mitchell is best known for designing the Supermarine Splitfire

(link: xxx) R.J. Mitchell is best known for designing the Supermarine Splitfire

work page
[74]

Search Results Figure 6: The overall architecture of the multimodal search pipeline

(link: xxx) ... Search Results Figure 6: The overall architecture of the multimodal search pipeline. Table 2: Prompt used for FVQA-train VQA Generation. Usage Prompt Factual QA Gener- ation Your task is to generate a factual question–answer pair based on the given visual concept, the image, and the associated webpage content. The generated question must s...

work page
[75]

The answer must contain the keyword visual concept

work page
[76]

Who", "What

The question must start with "Who", "What", or "Where"

work page
[77]

The question must NOT include the visual concept itself, nor any background knowledge directly related to it

work page
[78]

Unable to answer due to lack of relevant information

The question should resemble something a curious human without prior knowledge about the image might ask. In addition to the question, generate a concise and factual answer grounded in the visual concept, image, and webpage content. Visual Concept: {visual_concept} Image: {image} Webpage Content: {webpage_content} Respond only with the generated question ...

work page
[80]

Webpage Image: {image} Webpage Title: {title}

work page
[81]

Based on the question, image and image search results, please raise a text query to the search engine to search for what is useful for you to answer the question correctly

Webpage Image: {image} Webpage Title: {title} Assume you have access to a search engine (e.g., google). Based on the question, image and image search results, please raise a text query to the search engine to search for what is useful for you to answer the question correctly. You need to consider the characteristics of asking questions to search engines w...

work page 2000

Showing first 80 references.